2019 S&P 500 Data Exploration Stage

by Marc Angelo Acebedo

Table of Contents

I) Introduction

  • I kept features in separate CSVs because date formats differ. eps_fc and eps_act

After cleaning the original dataset as documented in my data wrangling process, data_cleaning.ipynb, I isolated the following columns:

features.csv

  • firm_id : foreign key referring to primary keys in firms.csv
  • feature : type of feature that the value field denotes (eps_fc, eps_act, eod_act, eps_fc_terms)
  • date : DateTime object in YYYY-MM-DD format
    • for eod_act, means the date at which the value was recorded
    • for eps_fc_terms, means the date at which the term forecast was made
  • term : period object in YYYYQQ format (the time period when the value was recorded)
    • FISCAL years for eps_fc and eps_act
    • CALENDAR years for eod_act and eps_fc_terms
  • value : displays the EPS or EOD value as specified in the 'feature' column.

avgs.csv

  • firm_id : foreign key referring to the primary keys in firms.csv
  • average : recorded average value as specified in the average_type column
  • average_type : type of average denoted (twenty year, quarterly, or yearly)
  • time_period : time period that the average is recorded:
    • for twenty-year averages, it's NaN
    • for yearly averages, displays the year (YYYY)
    • for quarterly averages, displays the quarter (QQ)
  • feature : type of feature recorded for the average (eps_fc, eps_act, eod_act, eps_fc_terms)

II) Data Setup & Overview

In [2]:
#import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sb
import statsmodels.api as sm
import math 
import random
import calendar

from matplotlib import cm

#rmse calculation
from sklearn.metrics import r2_score, mean_squared_error

%matplotlib inline
plt.style.use('bmh')
In [3]:
#define data directories
PATH_CLEAN = './data/clean/'
PATH_CLEAN_AVGS = './data/clean/averages/'
In [4]:
#define visuals destination
PATH_UNIVARIATE = './visuals/univariate/'
PATH_BIVARIATE = './visuals/bivariate/'
PATH_MULTIVARIATE = './visuals/multivariate/'

Import features and all averages

In [5]:
features = pd.read_csv(PATH_CLEAN + 'features.csv', low_memory = False)
avgs = pd.read_csv(PATH_CLEAN_AVGS + 'avgs.csv')
In [6]:
#import firm_ids for foreign key references
firm_ids = pd.read_csv(PATH_CLEAN + 'firms.csv')

Describe Datasets

In [7]:
#look at 5 random entries
features.sample(5)
Out[7]:
firm_id feature date term value
113840 369 eps_fc NaN 2005Q1 0.715
35672 424 eod_act 2013-03-28 2013Q1 36.560
63366 261 eps_fc_terms 2016-04-01 2016Q3 0.611
23529 280 eod_act 2001-06-29 2001Q2 15.800
68436 325 eps_fc_terms 2003-10-01 2004Q1 NaN
In [8]:
avgs.sample(5)
Out[8]:
firm_id average average_type time_period feature
11743 128 1.70050 yearly 2018 eps_fc
7407 337 0.45100 yearly 2009 eps_fc
50563 63 0.36855 quarterly q2 eps_fc_terms
14564 424 NaN yearly 2002 eps_act
23830 95 NaN yearly 2000 eod_act
In [9]:
print('FEATURES rows, columns = {}'.format(features.shape), '\n',
      'AVERAGES rows, columns = {}'.format(avgs.shape))
FEATURES rows, columns = (167660, 5) 
 AVERAGES rows, columns = (52015, 5)

Convert DateTime columns

In [10]:
features.dtypes
Out[10]:
firm_id      int64
feature     object
date        object
term        object
value      float64
dtype: object
In [11]:
features.date = pd.to_datetime(features.date)
features.term = pd.to_datetime(features.term).dt.to_period('Q')
In [12]:
#verify dtypes
features.dtypes
Out[12]:
firm_id             int64
feature            object
date       datetime64[ns]
term        period[Q-DEC]
value             float64
dtype: object
In [13]:
avgs.dtypes
Out[13]:
firm_id           int64
average         float64
average_type     object
time_period      object
feature          object
dtype: object
In [14]:
features.sample(5)
Out[14]:
firm_id feature date term value
89114 74 eps_fc NaT 2018Q3 0.786
87607 56 eps_fc NaT 2019Q4 0.726
13823 164 eod_act 2010-12-31 2010Q4 45.705
84165 16 eps_fc NaT 1999Q2 0.595
42465 0 eps_fc_terms 2011-01-01 2011Q2 0.588

III) Preliminary Exploration

Structure of the Datasets

Our features.csv dataset contains 167,660 entries with 5 features. Firm ID, feature, date, and term are all categorical variables while the value field is numeric and continuous. Even though the date and term fields are recorded as DateTime and Period objects respectively, they are still discrete, categorical data, because there is a limit to the year that can be recorded (1999 - 2020) and there cannot be more than 4 quarters (Q1 - Q5).

Our avgs.csv dataset contains 52,015 entries with 5 features. Firm ID, average type, time period, and feature are all categorical variables while the average field is numeric and continuous.

Main features(s) of interest in the dataset

I'm interested in seeing the historical correlation of forecasted vs. actual EPS across all firms in the 2019 S&P Index.

  • eps_fc, eps_act are the main variables of interest—they are consistent in measuring both the forecasted and actual EPS of all 505 firms, per fiscal period, over a span of 20 years.
  • eod_act is recorded based on calendar period instead of fiscal period, which is a flaw in the data. However, it can be used for some further exploration.
  • eps_fc_terms is based on calendar period. Although it depicts forecasted EPS, this is still not a main variable of interest because it can extend my main research question, but not fully answer it.

As somebody with very little familiarity with the stock market, I decided to dedicate this thesis project to my exploration of a field I am not familiar with.. I read the book "A Beginner's Guide to the Stock Market" by Matthew R. Kratter, which piqued my interest in stock market trading—particularly dividend stocks. I decided that if I were going to educate myself further on the stock market, then this thesis project to end my senior year at NCF would be a perfect opportunity to directly explore this new interest. Not only would I be educating myself on how the stock market works, but I would also be working directly with stock market data, which will help me build further intuition in future stock market and finance-related projects.

These are the questions I'd like to pose in my exploratory stage:

  1. Does average EPS prediction error depict any differences in trends whether by a yearly, quarterly, or full-term basis?

  2. I generate “naive EPS forecasts” by calculating the rolling mean of actual EPS from the past 2 quarters. How do my forecasted EPS forecasts compare to Bloomberg forecasts?

  3. What differences and similarities emerge when analyzing the prediction error and percentage error of EPS forecasts?

  4. Does statistical significance stay true among the relationships between actual EPS and forecasted EPS, regardless of forecasted type (raw data along with all yearly, quarterly, and twenty-year averages)?

  5. How do EOD prices trend across all firms from 1999 - 2019?

Features in the dataset that will support my investigation into the features of interest

For the broadest overview, I predict that overtime, EPS forecasts will continually become optimistic for those firms that consistently have high actual EPS values. Vice versa, overtime, EPS forecasts will continually become pessimistic for those firms that consistently have lower-than-expected EPS values. As for the other factors, I expect yearly values to show more consistency in pattern (since it is more intuitive to measure 20 years) than quarterly values (since economic situations are greatly diverse over the period of 20 years, no matter the period).

IV) Exploration

A) Univariate

MISSING VALUES (Features)

In [15]:
sb.set(style = "darkgrid")
In [16]:
def generate_missing_total(df, title_name, save_path, csv_name):
    plt.figure(figsize = [10, 5])
    plt.title('Missing Values per Column under ' + title_name, size = 20)
    na_counts = df.isna().sum().sort_values(ascending = True)
    
    na_counts.plot.barh(x = na_counts.values, y = na_counts.index);
    plt.xlabel('Count', size = 10)
    plt.ylabel('Column Name', size = 10)
    plt.savefig(save_path + csv_name)
In [17]:
features.isna().any()
Out[17]:
firm_id    False
feature    False
date        True
term       False
value       True
dtype: bool
In [18]:
generate_missing_total(features, '"Features"', PATH_UNIVARIATE, 'features-missing-total.png')

Observation 1: date is the field with the highest amount of missing data, with around 85,000 missing entries.

Observation 2: value contains around 26,000 missing entries.*

Observation 3: The gap in number of missing values between value and date is noticeably large.

  • This makes sense because date is automatically set to 'NaT' for eps_fc and eps_act.

Questions

1) For all the entries where date is not NaT, how large is the gap in missing values between values without the dropped dates and values with the dropped dates?

In [19]:
features_date_dropna = features[features['date'].notna()]
In [20]:
plt.figure(figsize = [10, 5])
plt.suptitle('Missing Values per Column under "Features"', size = 20)
plt.title('after dropping empty dates', size = 15)
na_counts = features_date_dropna.isna().sum().sort_values(ascending = True)

fig = na_counts.plot.barh(x = na_counts.values, y = na_counts.index);
plt.constrained_layout = True
plt.xlabel('Count', size = 10)
plt.ylabel('Column Name', size = 10)
plt.savefig(PATH_UNIVARIATE + 'features-missing-date-dropna.png')

Question 1: For all the entries where date is not NaT, how large is the gap in missing values between values without the dropped dates and values with the dropped dates?

Answer

Observation 1: Keeping NaT dates, the amount of overall missing values is around 26,000. After dropping all the NaT dates (dropping eod_actand eps_fc_terms), the amount of missing values dropped to around 14,000.

Observation 2: Drawing from the previous observation, this means that eps_fc and eps_act both have around 12,000 missing values in total.

Observation 3: Effectively, the undropped columns, eod_act and eps_fc_terms, have around 14,000 missing values in total.


Questions:

2) How many missing values does each feature have, individually?

In [21]:
#turn feature values into index values
features_num_nan = features[['feature', 'value']].groupby('feature').count()
In [22]:
#count number of total values (NaN or not) per feature
feature_counts = features.feature.value_counts()
In [23]:
#create function to calculate number NaN
def calculate_nan(num_nan, counts, feature_name):
    num_nan.loc[feature_name] = counts[feature_name] - num_nan.loc[feature_name]
    #num NaN (feature) = num all values (NaN or not) - num all values (not NaN)
    
#create array of feature names
feature_names = ['eps_fc', 'eps_act', 'eod_act', 'eps_fc_terms']
In [24]:
#update all nan counts per average type
for feature_name in feature_names:
    calculate_nan(features_num_nan, feature_counts, feature_name)
In [25]:
plt.figure(figsize = [10, 5])
sb.set(style = 'whitegrid')
plt.title('Missing Values per Feature under "Features"', size = 20)
plt.xlabel('Feature')
plt.ylabel('Count')

sb.barplot(x = features_num_nan.index, y = features_num_nan.value, color = "pink")
plt.savefig(PATH_UNIVARIATE + 'features-missing-per-feature.png')

Observation 1: The feature with the most amount of missing values is eps_fc_terms.

Observation 2: The feature with the least amount of missing values is eps_act.

Observation 3: eod_act, eps_fc, and eps_fc_terms all have a very small gap in missing data comparedto eps_act

Question 2: How many missing values does each feature have, individually?

Answer:

Observation 1: Below is a list of each feature and their corresponding estimated number of missing values:

  • eod_act : 7900
  • eps_act : 5500
  • eps_fc : 7050
  • eps_fc_terms : 7100

Observation 2: eps_fc and eps_act have around 12,000 missing values total, which is consistent with our previous observation.

Observation 3: eod_act and eps_fc_terms have around 14,000 missing values total, which is also consistent with our previous observation.

MISSING VALUES (Averages)

In [26]:
avgs.isna().any()
Out[26]:
firm_id         False
average          True
average_type    False
time_period      True
feature         False
dtype: bool
In [27]:
#averages
generate_missing_total(avgs, 'Averages', PATH_UNIVARIATE, 'avgs-missing-total.png')

Observation 1: The average column contains 6,000 missing data entries.

Observation 2: The time_period column contains exactly 2,000 missing data entries.

Observation 3: The gap in missing values between average and time_period is 4,000.

Observation 3: By default, all entries with average_type of twenty_year should have NaT fields for time_period.


Questions:

3) After dropping all NaT fields under time_period, how large will the gap in missing values be between values with the NaT dates and values after dropping the NaT dates?

4) How many missing average values does each average_type contain?

In [28]:
avgs_date_dropna = avgs[avgs['time_period'].notna()]
In [29]:
plt.figure(figsize = [10, 5])
plt.suptitle('Missing Values per Column under "Averages"', size = 20)
plt.title('after dropping empty time periods', size = 15)
na_counts = avgs_date_dropna.isna().sum().sort_values(ascending = True)

fig = na_counts.plot.barh(x = na_counts.values, y = na_counts.index);
plt.constrained_layout = True
plt.xlabel('Count', size = 10)
plt.ylabel('Column Name', size = 10)
plt.savefig(PATH_UNIVARIATE + 'avgs-missing-date-dropna.png')

Question 3: After dropping all NaT fields under time_period, how large will the gap in missing values be between values with the NaT dates and values after dropping the NaT dates?

Answer:

Observation 1: After dropping all NaT dates under time_period, there are still 6,000 missing averages. This is number is consistent with the number of missing averages even before dropping NaT dates. Thus, all entries with a missing time period (aka all twenty_year average types) did not contain any empty data.

To summarize the above, the number of missing average values remains the same whether or not you drop all entries with empty time periods, aka all twenty_year average types.

In [30]:
#turn avg values into index values
avgs_num_nan = avgs[['feature', 'average']].groupby('feature').count()
In [31]:
#count number of total values (NaN or not) per feature
avgs_counts = avgs.feature.value_counts()
In [32]:
#update number of missing values for each average type
for feature_name in feature_names:
    calculate_nan(avgs_num_nan, avgs_counts, feature_name)
In [33]:
avgs_num_nan
Out[33]:
average
feature
eod_act 1539
eps_act 1255
eps_fc 1655
eps_fc_terms 1562
In [34]:
plt.figure(figsize = [10, 5])
# sb.set(style = 'whitegrid')
plt.title('Missing Values per Feature under "Averages"', size = 20)
plt.xlabel('Feature')
plt.ylabel('Count')

sb.barplot(x = avgs_num_nan.index, y = avgs_num_nan.average, color = "pink")
plt.savefig(PATH_UNIVARIATE + 'avgs-missing-per-feature.png')

Observation 1: eps_fc has the most missing values at around 1650.

Observation 2: eps_act has the least amount of missing values at around 1250.

Observation 3: Under "Averages", the gap in missing values between each average type is larger than the missing values under "Features".

Answer:

Observation 1: Each average type has the following approximate number of missing averages:

  • eod_act : 1550
  • eps_act : 1250
  • eps_fc : 1650
  • eps_fc_terms : 1590
In [35]:
#turn avg values into index values
avgs_num_avg = avgs[['average_type', 'average']].groupby('average_type').count()
In [36]:
plt.figure(figsize = [10, 5])
# sb.set(style = 'whitegrid')
plt.title('Missing Values per Average Type under "Averages"', size = 20)
plt.xlabel('Average Type')
plt.ylabel('Count')

sb.barplot(x = avgs_num_avg.index, y = avgs_num_avg.average, color = "pink")
plt.savefig(PATH_UNIVARIATE + 'avgs-missing-per-avg-type.png')

Question 4: How many missing average values does each average_type contain?

Answer:

Observation 1:

  • quarterly: ~7500
  • twenty_year: ~2500
  • yearly: ~3600

Observation 2: The gap in missing values is greatly diversified across all average types.

FIRM_ID (Features)

Since 505 firms is too much to fit into one single visual, I decided to split them apart by focusing on the 20 most common firm ids and the 20 rarest firm ids.

In [37]:
def generate_pct_bar(df, cat_var, color):
    cat_counts = df[cat_var].value_counts()
    ax = sb.countplot(data = df, y = cat_var, order = cat_counts.index, palette = color)
    
    n_points = df.shape[0]
    locs, labels = plt.yticks()
    
    for p in ax.patches:
        percentage = '{:0.1f}%'.format(100 * p.get_width()/n_points)
        x = p.get_x() + p.get_width() + 0.02
        y = p.get_y() + p.get_height()/2
        ax.annotate(percentage, (x, y), size = 20)
In [38]:
features_firm_id_top = features.firm_id.value_counts()[:20].index
features_firm_id_top_lim = features.loc[features.firm_id.isin(features_firm_id_top)]

#check there are only 50 unique values
features_firm_id_top_lim.firm_id.nunique()
Out[38]:
20
In [39]:
plt.figure(figsize = [20, 15])
x = features_firm_id_top_lim
generate_pct_bar(x, 'firm_id', 'viridis_r')
# n, bins, patches = plt.hist(x, num_bins, facecolor = 'pink', alpha = 0.5)
plt.xlabel('Firm ID', size = 20)
plt.ylabel('Count', size = 20)
plt.title('Top 20 Most Common Firm IDs under "Features"', size = 25)

plt.savefig(PATH_UNIVARIATE + 'features-firm-id-count-top.png');
plt.show();
In [40]:
features_firm_id_bottom = features.firm_id.value_counts()[-20:].index
features_firm_id_bottom_lim = features.loc[features.firm_id.isin(features_firm_id_top)]

#check there are only 50 unique values
features_firm_id_bottom_lim.firm_id.nunique()
Out[40]:
20
In [41]:
plt.figure(figsize = [20, 15])
x = features_firm_id_bottom_lim
generate_pct_bar(x, 'firm_id', 'viridis_r')
# n, bins, patches = plt.hist(x, num_bins, facecolor = 'pink', alpha = 0.5)
plt.xlabel('Firm ID', size = 20)
plt.ylabel('Count', size = 20)
plt.title('20 Least Common Firm IDs under "Features"', size = 25)

plt.savefig(PATH_UNIVARIATE + 'features-firm-id-count-bottom.png');
plt.show();

Observation 1: Both the 20 most common and least common Firm IDs all make up the same proportion of existing Firm IDs: 5.0% each. This means that under "Features", there is a consistent count among all Firm IDs at around 335.

In [42]:
#check count consistency among firm_ids
firm_counts = features.firm_id.value_counts()
np.unique(firm_counts.sort_values().values)
Out[42]:
array([332], dtype=int64)

I discovered that all firm id counts are consistent across the entire features.csv dataset at 332 entries per firm id. There are no null firm ids.

FEATURE (Features)

In [43]:
import itertools
plt.figure(figsize = [13, 8])

#set palette
base_color = sb.color_palette()[5]

#countplot
plt.subplot(1, 2, 1)
sb.countplot(data = features, x = 'feature', order = features.feature.value_counts(ascending = True).index,
            color = base_color)
frame = plt.gca()

#pie chart
plt.subplot(1, 2, 2)
sorted_counts = features['feature'].value_counts()
plt.pie(features.feature.value_counts(), startangle = 90, counterclock = False,
        autopct='%1.2f%%', labels = features.feature.value_counts().index);
plt.axis('square');

#overall graphic
plt.suptitle('All Features by Count and Percentage under "Features"', size = 20)
plt.tight_layout()
plt.subplots_adjust(top = 0.9)
plt.savefig(PATH_UNIVARIATE + 'features-feature-pct-count.png')

Observation 1: eps_act, eps_fc, and eod_act all show consistent counts at around 42500 entries each (25.30% each).

Observation 2: eps_fc_terms is the only feature type to deviate from the others, having less entries at around 41,000 (24.10%).

Observation 3: It makes sense that eps_fc_terms contains missing data, because the year 1999 was not included while gathering this data. (effectively removing 2020 entries).

DATE (Features)

In [44]:
#double check that eps_fc and eps_act are the only features to have null Date entries
features[features.date.isna()].feature.unique()
Out[44]:
array(['eps_fc', 'eps_act'], dtype=object)

Since we acknowledged that the date column is set to NaT for eps_fc and eps_act, we will write all our interpretations accordingly.

All below graphs under "Date" do not take into account the features eps_fc and eps_act.

In [45]:
features_years = features.date.dt.year.dropna().astype(int)
features_months = features.date.dt.month.dropna().astype(int)
In [46]:
#years
plt.figure(figsize = [15, 5])

ax = sb.countplot(x = features_years, color = sb.color_palette()[0])
ax.set(xlabel = 'Year', ylabel = 'Count')
ax.set_title('Frequency of Years under "Features"', size = 20)

plt.rcParams['axes.labelsize'] = 15
plt.savefig(PATH_UNIVARIATE + 'features-date-years-count.png')
plt.show()

Observation 1: All years between 2000 - 2018 have a consistent count at around 4,000 for all firms.

  • This makes sense, because the year 1999 is missing from all eps_fc_terms entries.

Observation 2: The years 1999 and 2019 are both inconsistent and less than the number of 4,000 counts for all other Date years.

Observation 3: The year 1999 has 2500 non-null entries.

Observation 4: The year 2019 has 3500 non-null entries.

In [47]:
#months
features_months = features_months.apply(lambda x: calendar.month_abbr[x])
months_order = ['Jan', 'Mar', 'Apr', 'Jun', 'Jul', 'Sep', 'Oct', 'Dec']
In [48]:
plt.figure(figsize = [10, 5])

ax = sb.countplot(data = features, x = features_months, color = sb.color_palette()[0], order = months_order)
ax.set(xlabel = 'Month', ylabel = 'Count')
ax.set_title('Frequency of Months under "Features"', size = 20)
sb.despine(offset = 10, trim = False)

plt.rcParams['axes.labelsize'] = 15
plt.savefig(PATH_UNIVARIATE + 'features-date-months-count.png')
plt.show();

Observation 1: September has the most number of counts at around 15,000.

Observation 2: October has the least number of counts at around 9,000.

Observation 3: The trend in counts of months under "Date" fluctuates. It is not a linear or exponential pattern; it seems that there is a "peak" in counts every 2nd recorded month from January.

  • For example, March peaks at around 11,000 counts, then June peaks at around 10,500 counts, and Septemberat around 15,000.

It is safe to conclude that the "Date" column is unreliable when examining terms and years, and should be avoided. It is better to use the "Date" column only when referring to specific dates.

TERM (Features)

As noted earlier, there are no missing values under the Term column. Therefore, all graphs below do account for eps_act and eps_fc.

In [49]:
#years
plt.figure(figsize = [20, 10])

ax = sb.countplot(data = features, x = features.term.dt.year, color = sb.color_palette()[0])
ax.set(xlabel = 'Year', ylabel = 'Count')
ax.set_title('Frequency of Term Years under "Features"', size = 20)

plt.rcParams['axes.labelsize'] = 15
plt.savefig(PATH_UNIVARIATE + 'features-term-years-count.png')
plt.show();

Observation 1: All years from 2000 to 2019 have consistent counts at 8,000 per year.

Observation 2: Unlike the years under Date, 1999 is the only year that does not follow the general trend. There are 6000 recorded entries containing 1999. This means that 2,000 entries do not contain the year 1999.


In [50]:
#quarter
plt.figure(figsize = [10, 5])

ax = sb.countplot(data = features, x = features.term.dt.quarter, color = sb.color_palette()[0])
ax.set(xlabel = 'Quarter', ylabel = 'Count')
ax.set_title('Frequency of Term Quarters under "Features"', size = 20)

plt.rcParams['axes.labelsize'] = 15
plt.savefig(PATH_UNIVARIATE + 'features-term-quarters-count.png')
plt.show();

Observation 1: There is a consistent number of quarters under term, unlike years.

Observation 2: This means that quarters is a more stable, reliable variable to examine under Term unlike years, which should be examined more closely in regard to which feature(s) contains that year.

VALUE (Features)

In [51]:
def generate_hist(df, x, bins, title, xlabel, ylabel, save_path, csv_name):
    plt.figure(figsize = [14, 7])
    
    plt.hist(data = df, x = x, bins = bins, color = 'palevioletred')
    plt.title(title, size = 25)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    
    plt.savefig(save_path + csv_name)
In [52]:
value_bins = np.arange(features.value.min(), features.value.max() + 100, 100)
generate_hist(features, 'value', value_bins, 'Distribution of All Values under Features', 
                'Value (eps_fc, eps_act, eod_act, OR eps_fc_terms)',
                'Count', PATH_UNIVARIATE, 'features-value-hist.png')

Observation 1: The trend in value counts is heavily right-skewed

Observation 2: The value range 0-100 contains the highest concentration of data, with over 120,000 entries.

Observation 3: The value range 0-300 contains the "bulk" of all the data, which means the surrounding x-axis values are all outliers.


Questions:

4) How do value counts under Features look like after removing all outliers around the value range 0-300?

In [53]:
value_bins = np.arange(-5, 200 + 10, 10)
value_hist = generate_hist(features, 'value', value_bins, 'Distribution of All Values under Features (without outliers)', 
                'Value (eps_fc, eps_act, eod_act, OR eps_fc_terms)',
                'Count', PATH_UNIVARIATE, 'features-value-hist-zoom-1.png')

Question 5: How do value counts under Features look like after removing all outliers around the value range 0-300?

Observation 1: The graph appears to be a normal distribution with a strong right skew. This is still unlike the graph with all the outliers, where the right skew is much more pronounced and prominent.

Observation 2: Instead of breaking the bins up by bin widths of 100, the bin width here is 10.

Observation 3: The value range 0-10 contains the bulk of all the data. This is the highest bar, with counts of around 12,000.

  • The previous graph showed that the bin range 0-100 contains around 120,000 points.
  • This graph shows that the bin range 0-10 contains around 120,000 points.
  • Drawing from the previous 2 observations, we can conclude that it's really the bin range 0-10 that contains the bulk of all the data.

It would be a good idea to break down the 0-10 bin range even further, and examine solely that range.

In [54]:
value_bins = np.arange(0, 10 + 0.05, 0.05)
value_hist = generate_hist(features, 'value', value_bins, 'Distribution of All Values under Features (range 0-10)', 
                'Value (eps_fc, eps_act, eod_act, OR eps_fc_terms)',
                'Count', PATH_UNIVARIATE, 'features-value-hist-zoom-2.png')

Observation 1: The graph, even when zoomed in, still retains a strong right skew.

Observation 2: The "bulk" of the data is around the value range 0.02 to 0.05: the "peak" of the distribution.

Just to make sure this graph is skewed heavily to the right, let's create a kernel density curve:

In [55]:
def generate_distplot(data, bins):
    fig = plt.figure(figsize = [14, 7])
    ax = sb.distplot(data, bins = bins, color = 'hotpink')
    ax.minorticks_on()
    return fig, ax
In [56]:
value_bins = np.arange(features.value.min(), features.value.max() + 20, 20)
generate_distplot(features.value.dropna(), bins = value_bins)
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Distribution of all Values under "Features"', size = 25);
plt.savefig(PATH_UNIVARIATE + 'features-value-dist.png')

Observation: As shown by the kernel density curve, the distribution of values under Features is heavily right-skewed. This means that most values are clustered around the left-side of the distribution where the mean, median, and mode are all located.

FIRM_ID (Averages)

In [57]:
avgs_firm_id_top = avgs.firm_id.value_counts()[:20].index
avgs_firm_id_top_lim = avgs.loc[avgs.firm_id.isin(avgs_firm_id_top)]

#check there are only 20 unique values
avgs_firm_id_top_lim.firm_id.nunique()
Out[57]:
20
In [58]:
plt.figure(figsize = [15, 15])
x = avgs_firm_id_top_lim
num_bins = 5
generate_pct_bar(x, 'firm_id', 'viridis_r')
# n, bins, patches = plt.hist(x, num_bins, facecolor = 'pink', alpha = 0.5)
plt.xlabel('Firm ID')
plt.ylabel('Count')
plt.title('Top 20 Most Common Firm IDs under "Averages"', size = 20)

plt.savefig(PATH_UNIVARIATE + 'avgs-firm-id-count-top.png');
plt.show();
In [59]:
avgs_firm_id_bottom = avgs.firm_id.value_counts()[-20:].index
avgs_firm_id_bottom_lim = avgs.loc[avgs.firm_id.isin(avgs_firm_id_top)]

#check there are only 20 unique values
avgs_firm_id_bottom_lim.firm_id.nunique()
Out[59]:
20
In [60]:
plt.figure(figsize = [20, 15])
x = avgs_firm_id_bottom_lim
generate_pct_bar(x, 'firm_id', 'viridis_r')
# n, bins, patches = plt.hist(x, num_bins, facecolor = 'pink', alpha = 0.5)
plt.xlabel('Firm ID', size = 20)
plt.ylabel('Count', size = 20)
plt.title('20 Least Common Firm IDs under "Averages"', size = 25)

plt.savefig(PATH_UNIVARIATE + 'avgs-firm-id-count-bottom.png');
plt.show();

Observation 1: Both the 20 most common and least common Firm IDs all make up the same proportion of existing Firm IDs: 5.0% each. This means that under "Averages", there is a consistent count among all Firm IDs at around 105.

In [61]:
#check consistency of counts among firm_ids
firm_counts = avgs.firm_id.value_counts()
np.unique(firm_counts.sort_values().values)
Out[61]:
array([103], dtype=int64)

I discovered that all firm id counts are consistent across the entire avgs.csv dataset at 105 entries per firm id. There are no null firm ids.

AVERAGE (Averages)

In [62]:
value_bins = np.arange(avgs.average.min(), avgs.average.max() + 100, 100)
generate_hist(avgs, 'average', value_bins, 'Distribution of All Averages', 
                'Average(twenty-year, quarterly, yearly)',
                'Count', PATH_UNIVARIATE, 'avgs-avg-hist.png')

Observation 1: The trend in value counts is heavily right-skewed

Observation 2: The value range 0-100 contains the highest concentration of data, with around 36,000 entries.

Observation 3: The value range 0-300 contains the "bulk" of all the data, which means the surrounding x-axis values are all outliers.

Observation 4: There is a all values, whether under Features or Averages, share a common theme of taking up the "bulk" of data in the same bin range


Questions:

6) How do value counts under Averages look like after removing all outliers around the value range 0-300?

In [63]:
value_bins = np.arange(-5, 200 + 10, 10)
generate_hist(avgs, 'average', value_bins, 'Distribution of All Averages (without Outliers)', 
                'Average(twenty-year, quarterly, yearly)',
                'Count', PATH_UNIVARIATE, 'avgs-avg-hist-zoom-1.png')

Question 6: How do average counts under Averages look like after removing all outliers around the value range 0-300?

Observation 1: The graph appears to be a normal distribution with a strong right skew. This is still unlike the previous graph with all the outliers, where the right skew is much more pronounced and prominent.

Observation 2: Instead of breaking the bins up by bin widths of 100, the bin width here is 10.

Observation 3: The value range 0-10 contains the bulk of all the data. This is the highest bar, with counts of around 12,000.

  • The previous graph showed that the bin range 0-100 contains around 36,000 points.
  • This graph shows that the bin range 0-10 contains around 34,000 points.
  • Drawing from the previous 2 observations, we can conclude that it's really the bin range 0-10 that contains the bulk of all the data, with the other ~2,000 points being scattered among the bins from 10 to 110.

Observation 4: Among the bins from 10 to 110 that contain the other 2,000 counts, the bin 30-40 contains the "bulk" of that data at around 2,000 counts.

It would be a good idea to break down the 0-10 bin range even further for Averages, and examine solely that range.

In [64]:
value_bins = np.arange(0, 10 + 0.05, 0.05)
value_hist = generate_hist(avgs, 'average', value_bins, 'Distribution of All Averages (range 0-10)', 
                'Average',
                'Count', PATH_UNIVARIATE, 'avgs-avgs-hist-zoom-2.png')

Observation 1: The graph, even when zoomed in, still retains a strong right skew.

Observation 2: The "bulk" of the data is around the value range 0.02 to 0.05: the "peak" of the distribution.

Just to make sure this graph is skewed heavily to the right, let's create a kernel density curve:

In [65]:
value_bins = np.arange(avgs.average.min(), avgs.average.max() + 20, 20)
generate_distplot(avgs.average.dropna(), bins = value_bins)
plt.xlabel('Average')
plt.ylabel('Density')
plt.title('Distribution of all Averages under "Averages"', size = 25);
plt.savefig(PATH_UNIVARIATE + 'avgs-average-dist.png')

Observation: As shown by the kernel density curve, the distribution of average under Averages is heavily right-skewed. This means that most averages are clustered around the left-side of the distribution where the mean, median, and mode are all located.

A KEY TAKEAWAY: There is a strong congruency between the distribution patterns of stock prices (values) under features.csv and average prices (averages) under Averages.

AVERAGE TYPE (Averages)

In [66]:
plt.figure(figsize = [10, 5])
cat_order = avgs.average_type.value_counts().index
sb.countplot(data = avgs, x = 'average_type', color = 'hotpink', order = cat_order)
plt.xlabel('Average Type')
plt.ylabel('Count')
plt.title('Average Type by Count', size = 20)
plt.savefig(PATH_UNIVARIATE + 'avgs-avgtype-count.png')

Observation 1: Under Averages, yearly data has the highest number of counts while twenty-year data has the least number of counts. This all makes sense, as there are more yearly averages than quarters per firm, and more quarters than twenty-year entries per firm.

  • For each firm, there are around 20 yearly averages (1999 to 2019) for each of the 4 features. (80 total per firm).
  • For each firm, there are 4 quarterly averages for each of the 4 features. (16 total per firm)
  • For each firm, there is 1 twenty-year average for each of the 4 features. (4 total per firm)

Note: number of entries may vary for yearly averages because the year 1999 and 2019 are missing for some features.

In [67]:
#keep for analysis, delete when done w/ report
len(avgs.query('average_type == "yearly" and firm_id == 0'))
Out[67]:
83
In [68]:
colors = random.choices(list(mcolors.CSS4_COLORS.values()), k = 3)

plt.figure(figsize = [10, 5])
cs=cm.Set1(np.arange(40)/40.)

plt.pie(avgs.average_type.value_counts(), startangle = 90, counterclock = False,
        autopct='%1.2f%%', labels = avgs.average_type.value_counts().index, colors = colors);
plt.suptitle('Average Type by Percentage', size = 20)
plt.tight_layout()
plt.subplots_adjust(top = 0.9)
plt.axis('square');
plt.savefig(PATH_UNIVARIATE + 'avgs-avgtype-pie.png')

Observation 1: As expected, yearly average types make up the majority of all entries at 80.58%.

Observation 2: As expected, twenty-year average types make up the least portion of all entries at 3.88%.

Observation 3: These percentages are consistent with the previous counts.

TIME_PERIOD (Averages)

In [69]:
plt.figure(figsize = [15, 15])
cat_order = avgs.time_period.value_counts().sort_index().index
sb.countplot(data = avgs, x = 'time_period', color = 'hotpink', order = cat_order)
plt.xlabel('Time Period (yearly or quarterly)')
plt.ylabel('Count')
plt.title('Yearly & Quarterly Time Periods by Count (Averages)', size = 20)
plt.savefig(PATH_UNIVARIATE + 'avgs-time-period-hist.png')

I decided to portray both yearly and quarterly Time Periods in the same countplot because q1-q4 and the years 2000-2019 show consistent counts.

Observation 1: All years from 2000-2019 and all quarters show consistent counts, at just over 2,000 entries each.

Observation 2: The year 1999 has the least amount of counts, at only around 1510. This is because the feature eps_fc_terms does not contain the year 1999.

Observation 3: The time period for twenty-year averages does not show because by default, all Time Period entries associated with average type of twenty-year is set to null.

FEATURE (Averages)

In [70]:
plt.figure(figsize = [14, 7])

#set palette
base_color = sb.color_palette()[5]

#countplot
plt.subplot(1, 2, 1)
cat_order = avgs.feature.value_counts().sort_index(ascending = False).index
# generate_pct_bar(avgs, 'feature', 'Blues_d')
sb.countplot(data = avgs, x = avgs.feature, color = 'hotpink', order = cat_order)

#pie chart
plt.subplot(1, 2, 2)
sorted_counts = avgs['feature'].value_counts()
plt.pie(avgs.feature.value_counts(), startangle = 90, counterclock = False,
        autopct='%1.2f%%', labels = avgs.feature.value_counts().index);
cat_order = avgs.feature.value_counts().sort_index(ascending = False).index
plt.axis('square');

#overall graphic
plt.suptitle('All Features by Count and Percentage under "Averages"', size = 20)
# plt.tight_layout()
plt.subplots_adjust(top = 0.9)
plt.savefig(PATH_UNIVARIATE + 'avgs-feature-hist-pie.png')

Observation 1: eps_act, eps_fc, and eod_act all show consistent counts at around 13,000 entries each (25.24% each).

Observation 2: eps_fc_terms is the only feature type to deviate from the others, having less entries at around 12,500 (24.27%).

Observation 3: It makes sense that eps_fc_terms contains missing data, because the year 1999 was not included while gathering this data. This effectively removes around 500 entries: not a substantial amount of data.

IV) Bivariate Exploration

General rule: The closer the eps_act - eps_fc difference is to 0, the more accurate the forecast.

VALUE by FEATURE

In [71]:
temp = features[features.feature.isin(['eps_fc', 'eps_act'])]
In [72]:
plt.figure(figsize = [10, 10])
sb.stripplot(x = 'feature', y = 'value', data = features, jitter = True)

plt.title('Stock Price Types and ', size = 20)
plt.xlabel('Stock Price Type')
plt.ylabel('Value')

plt.xticks([0, 1, 2, 3], ['EOD', 'Forecasted EPS (3 Months Prior)', 'Forecasted EPS', 'Actual EPS'])
plt.savefig(PATH_BIVARIATE + 'features-feature-value.png')
plt.show();

Observation 1: EOD Price has the most variability among the other stock price types.

Observation 2: Forecasted EPS (3 months prior) and Forecasted EPS (current) are the only stock price types to appear uniforn in distribution: their values are clustered around 0.

Observation 3: Actual EPS has the most outliers compared to the other stock price types: 5 outliers. And since actual EPS has the most outliers, it turns out that forecasters' predictions have been too uniform and stable in the approach of analyzing stock market trends.


EOD Price by Term (features.csv)

In [73]:
features_eod = features[features.feature == 'eod_act']

#create 5-year moving averages to "smooth" out all future graphs
features_eod['ma_value'] = features_eod.value.rolling(12).mean()
D:\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
In [74]:
plt.figure(figsize = [20, 10])
sb.pointplot(x = 'term', y = 'value', data = features_eod, errwidth = 0.5)
plt.xticks(rotation = 'vertical')

plt.title('Average EOD Price by Term Across All Firms', size = 20)
plt.xlabel('Term')
plt.ylabel('Average EOD Price')

plt.savefig(PATH_BIVARIATE + 'features-eod-term.png')

Observation 1: The graph depicts a positive linear trend from all quarters spanning 1999 - 2019.

Observation 2: The average EOD Price troughs during 2009 Quarter 1. The start of the trend in this downfall starts in 2007 Quarter 4, but the average EOD price recovers shortly after, only 1 quarter later on 2009, Quarter 2.

In layman's terms, this generally positive linear trend shows that all 505 firms in the S&P 2019 Index have been getting "richer" over the past 20 years.

In [75]:
plt.figure(figsize = [20, 10])
sb.pointplot(x = 'term', y = 'ma_value', data = features_eod, errwidth = 0.5)
plt.xticks(rotation = 'vertical')

plt.suptitle('Average EOD Price by Term Across All Firms', size = 20, y = .93)
plt.title('in 3-year Moving Averages from 1999 - 2019', size = 15)
plt.xlabel('Term')
plt.ylabel('Average EOD Price')

plt.savefig(PATH_BIVARIATE + 'features-eod-term-ma.png')

The EOD graph could be used to make correlations with EPS forecasts and how the firm is getting "richer" overtime.

DIFFERENCE IN FORECASTED AND ACTUAL EPS (PREDICTION ERROR) BY TERM

In [76]:
#isolate eps_fc and eps_act in their own columns to get their difference
features_diffs = features[features.feature.isin(['eps_fc', 'eps_act'])][['firm_id', 'term', 'value', 'feature']]
In [77]:
#isolate eps_fc and eps_act
eps_fc_diffs = features_diffs[features_diffs.feature == 'eps_fc'].rename(columns = {'value' : 'eps_fc_value'})[['firm_id', 'term', 'eps_fc_value']]
eps_act_diffs = features_diffs[features_diffs.feature == 'eps_act'].rename(columns = {'value' : 'eps_act_value'})[['firm_id', 'term', 'eps_act_value']]
In [78]:
#merge, creating separate columns for eps_fc and eps_act values
act_fc_diffs = eps_fc_diffs.merge(eps_act_diffs, how = 'inner', left_on = ['firm_id', 'term'], right_on = ['firm_id', 'term'])

act_fc_diffs.sample(5)
Out[78]:
firm_id term eps_fc_value eps_act_value
15764 187 2013Q1 0.607 0.50
34661 412 2012Q2 0.853 1.08
27829 331 2005Q2 0.710 0.43
30264 360 2005Q1 0.283 0.30
36186 430 2015Q3 0.489 NaN
In [79]:
#create 'difference' column
act_fc_diffs['difference'] = act_fc_diffs['eps_act_value'] - act_fc_diffs['eps_fc_value']
In [80]:
plt.figure(figsize = [20, 15])
sb.pointplot(data = act_fc_diffs, x = 'term', y = 'difference', color = 'purple', errwidth = .5)

plt.suptitle('Average Prediction Error of EPS Values by Term', size = 20, y = .93)
plt.title('across All Firms', size = 19)
plt.xticks(rotation = 'vertical')
plt.xlabel('Term (Year and Quarter)')
plt.ylabel('Average Prediction Error')

plt.savefig(PATH_BIVARIATE + 'features-act-fc-diffs-term.png')
plt.show();

Observation 1: Forecasters were most optimistic in 2008, Quarter 4.

Observation 2: Forecasters were most pessimistic in 2000, Quarter 4.

Observation 3: The trend indicates no linear relationship; it follows in a stable, one-dimensional direction with average prediction errors center around 0.

In [81]:
plt.figure(figsize = [20, 15])
sb.pointplot(data = act_fc_diffs, x = act_fc_diffs.term.dt.year, y = 'difference', color = 'purple', errwidth = .5)

plt.suptitle('Average Prediction Error of EPS Values by Year', size = 20, y = .93)
plt.title('across All Firms', size = 19)
# plt.xticks(rotation = 'vertical')
plt.xlabel('Year')
plt.ylabel('Average Prediction Error')

plt.savefig(PATH_BIVARIATE + 'features-act-fc-diffs-year.png')
plt.show();

Observation 1: Forecasters were most pessimistic in the year 2000. From the error bar, the year 2000 shows one of the widest variances in average prediction errors ranging from 0 until 0.5

Observation 2: Forecasters were most optimistic in the year 2008. From the error bar, the year 2008 shows one of the widest variances in average prediction errors ranging from -0.25 until -1.5.

Observation 3: The overall trend stays consistent with the previous trends by term: consistent, relatively stable, and containing no slope.

In [82]:
plt.figure(figsize = [20, 15])
sb.pointplot(data = act_fc_diffs, x = act_fc_diffs.term.dt.quarter, y = 'difference', color = 'purple', errwidth = .5)

plt.suptitle('Average Prediction Error of EPS Values by Quarter', size = 20, y = .93)
plt.title('across All Firms', size = 19)
# plt.xticks(rotation = 'vertical')
plt.xlabel('Quarter')
plt.ylabel('Average Prediction Error')

plt.savefig(PATH_BIVARIATE + 'features-act-fc-diffs-quarter.png')
plt.show();

Observation: The later the quarter, the more optimistic forecasters become in their predictions. I am assuming this is because each new quarter brings the familiarity of the current year's stock market dynamics.

DIFFERENCE IN MOVING AVERAGE FORECASTED AND ACTUAL EPS (PREDICTION ERROR) BY TERM

In [83]:
#3-year moving averages
act_fc_diffs['ma_difference'] = act_fc_diffs.difference.rolling(12).mean()
In [84]:
plt.figure(figsize = [20, 15])
sb.pointplot(data = act_fc_diffs, x = 'term', y = 'ma_difference', color = 'purple', errwidth = .5)

plt.suptitle('Average Prediction Error of EPS Values by Term', size = 20, y = .93)
plt.title('in 3-Year Moving Averages (Actual - Forecasted EPS)', size = 19)
plt.xticks(rotation = 'vertical')
plt.xlabel('Term (Year and Quarter)')
plt.ylabel('Average Prediction Error')

plt.savefig(PATH_BIVARIATE + 'features-act-fc-diffs-term-ma.png')
plt.show();
In [85]:
plt.figure(figsize = [20, 15])
sb.pointplot(data = act_fc_diffs, x = act_fc_diffs.term.dt.year, y = 'ma_difference', color = 'purple', errwidth = .5)

plt.suptitle('Average Prediction Error of EPS Values by Year', size = 20, y = .93)
plt.title('in 3-Year Moving Averages (Actual - Forecasted EPS)', size = 19)
plt.xlabel('Year')
plt.ylabel('Average Prediction Error')

plt.savefig(PATH_BIVARIATE + 'features-act-fc-diffs-year-ma.png')
plt.show();
In [86]:
plt.figure(figsize = [20, 15])
sb.pointplot(data = act_fc_diffs, x = act_fc_diffs.term.dt.quarter, y = 'ma_difference', color = 'purple', errwidth = .5)

plt.suptitle('Average Prediction Error of EPS Values by Quarter', size = 20, y = .93)
plt.title('in 3-Year Moving Averages (Actual - Forecasted EPS)', size = 19)
plt.xlabel('Quarter')
plt.ylabel('Average Prediction Error')

plt.savefig(PATH_BIVARIATE + 'features-act-fc-diffs-quarter-ma.png')
plt.show();

Observation 1: Compared to their non-moving-average counterparts, the above pointplots plotting 3-year moving averages show a consistent slopeless trned, including periods of optimism vs. pessimism.

DIFFERENCE IN TWENTY YEAR AVERAGE OF EPS_FC AND EPS_ACT BY FIRM (Averages)

In [87]:
#array of all firm ids
ids = firm_ids.firm_id.values
In [88]:
#put eps_fc and eps_act averages into separate DFs
avgs_eps_fc = avgs[avgs['feature'] == 'eps_fc']
avgs_eps_act = avgs[avgs['feature'] == 'eps_act']
In [89]:
#isolate eps_fc and eps_act DFs by "twenty_year" average type
avgs_eps_fc_twenty = avgs_eps_fc[avgs_eps_fc['average_type'] == 'twenty_year']
avgs_eps_act_twenty = avgs_eps_act[avgs_eps_act['average_type'] == 'twenty_year']
In [90]:
#df of only twenty-year average types
avgs_twenty = pd.concat([avgs_eps_fc_twenty, avgs_eps_act_twenty])
In [91]:
#obtain differences in eps_fc and eps_act averages BY firm id
avg_eps_fc_twenty = avgs_twenty.query('feature == "eps_fc"').average.values
avg_eps_act_twenty = avgs_twenty.query('feature == "eps_act"').average.values
twenty_diff_data = list(zip(ids, avg_eps_fc_twenty, avg_eps_act_twenty))

twenty_diffs = pd.DataFrame(twenty_diff_data, columns = ['firm_id', 'average_eps_fc', 'average_eps_act'])
twenty_diffs['difference'] = twenty_diffs['average_eps_act'] - twenty_diffs['average_eps_fc']
twenty_diffs['average_type'] = 'twenty_year'
In [92]:
twenty_diffs.sample(5)
Out[92]:
firm_id average_eps_fc average_eps_act difference average_type
422 422 0.856357 0.880014 0.023657 twenty_year
488 488 0.233048 0.113171 -0.119877 twenty_year
153 153 0.893179 0.218497 -0.674681 twenty_year
104 104 0.288733 0.269903 -0.018831 twenty_year
215 215 1.796500 1.771177 -0.025323 twenty_year
In [93]:
plt.figure(figsize = [20, 20])
plt.title('Prediction Error among Twenty-Year Averages by Firm ID\n(Actual EPS - Forecasted EPS)', size = 25)
plt.plot(twenty_diffs['firm_id'], twenty_diffs['difference'])

plt.xlabel('Firm ID', size = 15)
plt.ylabel('Prediction Error', size = 15)

plt.savefig(PATH_BIVARIATE + 'avgs-diff-twenty-year.png')
plt.show();

DIFFERENCE IN QUARTERLY AVERAGE OF EPS_FC AND EPS_ACT (Averages)

In [94]:
#isolate eps_fc and eps_act DFs by "quarterly" average type
avgs_eps_fc_quarter = avgs_eps_fc[avgs_eps_fc['average_type'] == 'quarterly']
avgs_eps_act_quarter = avgs_eps_act[avgs_eps_act['average_type'] == 'quarterly']
In [95]:
#df of only quarterly averages
avgs_quarter = pd.concat([avgs_eps_fc_quarter, avgs_eps_act_quarter])
In [96]:
#get the average average separately per firm
avgs_quarter_sep = avgs_quarter.groupby(['firm_id', 'feature']).mean().reset_index()
avgs_quarter_sep.shape[0]
Out[96]:
1010
In [97]:
#obtain differences in eps_fc and eps_act averages BY firm id
avg_eps_fc_quarter = avgs_quarter_sep.query('feature == "eps_fc"').average.values
avg_eps_act_quarter = avgs_quarter_sep.query('feature == "eps_act"').average.values

quarter_diff_data = list(zip(ids, avg_eps_fc_quarter, avg_eps_act_quarter))

quarter_diffs = pd.DataFrame(quarter_diff_data, columns = ['firm_id', 'average_eps_fc', 'average_eps_act'])
quarter_diffs['difference'] = quarter_diffs['average_eps_act'] - quarter_diffs['average_eps_fc']
quarter_diffs['average_type'] = 'quarter'
In [98]:
quarter_diffs.head()
Out[98]:
firm_id average_eps_fc average_eps_act difference average_type
0 0 0.530992 0.350250 -0.180742 quarter
1 1 0.338961 -0.394274 -0.733235 quarter
2 2 1.054504 0.986728 -0.067777 quarter
3 3 0.873143 0.944501 0.071358 quarter
4 4 1.319607 0.762376 -0.557231 quarter
In [99]:
plt.figure(figsize = [20, 15])
plt.xlabel('Firm ID', size = 15)
plt.ylabel('Prediction Error', size = 15)
plt.title('Prediction Error among Quarterly Averages by Firm ID\n(Actual EPS - Forecasted EPS)', size = 25)
plt.plot(quarter_diffs['firm_id'], quarter_diffs['difference'])
# plt.scatter(twenty_diffs['firm_id'], twenty_diffs['difference'])
plt.savefig(PATH_BIVARIATE + 'avgs-diff-quarter.png')
plt.show();

DIFFERENCE IN YEARLY AVERAGE OF EPS_FC AND EPS_ACT (Averages)

In [100]:
#isolate eps_fc and eps_act DFs by "yearly" average type
avgs_eps_fc_year = avgs_eps_fc[avgs_eps_fc['average_type'] == 'yearly']
avgs_eps_act_year = avgs_eps_act[avgs_eps_act['average_type'] == 'yearly']
In [101]:
#df of only yearly averages
avgs_year = pd.concat([avgs_eps_fc_year, avgs_eps_act_year])
In [102]:
#get the average average separately per firm
avgs_year_sep = avgs_year.groupby(['firm_id', 'feature']).mean().reset_index()
avgs_year_sep.sample(5)
Out[102]:
firm_id feature average
385 192 eps_fc 1.447800
454 227 eps_act 0.661924
330 165 eps_act -0.028917
20 10 eps_act 0.478036
877 438 eps_fc 0.576583
In [103]:
#obtain differences in eps_fc and eps_act averages BY firm id
avg_eps_fc_year = avgs_year_sep.query('feature == "eps_fc"').average.values
avg_eps_act_year = avgs_year_sep.query('feature == "eps_act"').average.values

year_diff_data = list(zip(ids, avg_eps_fc_year, avg_eps_act_year))

year_diffs = pd.DataFrame(year_diff_data, columns = ['firm_id', 'average_eps_fc', 'average_eps_act'])
year_diffs['difference'] = year_diffs['average_eps_act'] - year_diffs['average_eps_fc']
year_diffs['average_type'] = 'year'
In [104]:
plt.figure(figsize = [20, 15])
plt.xlabel('Firm ID', size = 15)
plt.ylabel('Prediction Error', size = 15)
plt.title('Prediction Error among Yearly Averages by Firm ID\n(Actual EPS - Forecasted EPS)', size = 25)
# plt.title('(Average Forecasted EPS - Average Actual EPS)')
# plt.rcParams = plt.rcParamsDefault
plt.plot(year_diffs['firm_id'], year_diffs['difference'])
# plt.scatter(twenty_diffs['firm_id'], twenty_diffs['difference'])
plt.savefig(PATH_BIVARIATE + 'avgs-diff-year.png')
plt.show();
  • Put all the above graphs into a single FacetGrid for easy accessibility and comparing, in the Multivariate Exploration stage.
In [105]:
#put all yearly, quarterly, and twenty-year differences into one DF
all_diffs = pd.concat([twenty_diffs, quarter_diffs, year_diffs], sort = False)
all_diffs.sample(5)
Out[105]:
firm_id average_eps_fc average_eps_act difference average_type
349 349 0.818321 0.822310 0.003988 year
97 97 0.529571 0.571867 0.042296 twenty_year
232 232 0.405600 0.263333 -0.142267 year
242 242 0.342000 0.370972 0.028972 year
407 407 -0.035661 -0.289988 -0.254327 twenty_year
  • Now that we looked at the raw data, let's create a line plot depicting the average difference among all average types.
In [106]:
plt.figure(figsize = [20, 15])
plt.xlabel('Firm ID', size = 15)
plt.ylabel('Prediction Error', size = 15)
plt.title('Differences in Yearly Averages by Firm ID\n(Actual EPS - Forecasted EPS)', size = 25)
# plt.title('(Average Forecasted EPS - Average Actual EPS)')
# plt.rcParams = plt.rcParamsDefault
plt.plot(year_diffs['firm_id'], year_diffs['difference'])
# plt.scatter(twenty_diffs['firm_id'], twenty_diffs['difference'])
plt.savefig(PATH_BIVARIATE + 'avgs-diff-year.png')
plt.show();
In [107]:
plt.figure(figsize = [20, 15])
plt.xlabel('Firm ID', size = 15)
plt.ylabel('Prediction Error', size = 15)
plt.title('Prediction Error among All Averages by Firm ID\n(Twenty-Year, Yearly, and Quarterly)', size = 25)
plt.plot(all_diffs['firm_id'], all_diffs['difference'])
# plt.scatter(twenty_diffs['firm_id'], twenty_diffs['difference'])
plt.savefig(PATH_BIVARIATE + 'avgs-diff-all.png')
plt.show();

Observation: Since all graphs depicting the average differences of forecasted and actual EPS, it is safe to use the all_diffs DF to answer for the yearly, quarterly, and twenty-year interpretations.

Approach: Isolate the top 20 firms with the largest absolute differences (most inaccurate) and the top 20 firms with the smallest absolute differences (most accurate)

In [108]:
#helper function to convert a DF for firm_ids to their tickers
def convert_ids_to_tickers(series):
    series = series.apply(lambda x: firm_ids[firm_ids.firm_id == x].values[0][1]).values
    return [x.upper() for x in series]
In [109]:
#add column for absolute difference
all_diffs['difference_abs'] = all_diffs['difference'].abs()
In [110]:
#create column referring to firm ticker
all_diffs['firm_ticker'] = convert_ids_to_tickers(all_diffs.firm_id)
In [111]:
#rename column and values to fit legend
all_diffs.average_type = all_diffs.average_type.str.replace('_', ' ').str.capitalize()
all_diffs = all_diffs.rename(columns = {'average_type' : 'Average Type'})

Twenty-Year

In [112]:
#isolate twenty-year average type
twenty_diffs = all_diffs[all_diffs['Average Type'] == 'Twenty year']
In [113]:
#reorder firms by absolute difference
twenty_diffs_top = twenty_diffs.sort_values(by='difference_abs', ascending = False)
twenty_diffs_bottom = twenty_diffs.sort_values(by='difference_abs', ascending = True)
In [114]:
#drop rows with duplicate firm ids, get top 20 firm ids
twenty_diffs_top = twenty_diffs_top.drop_duplicates(subset = 'firm_id', keep = 'first').head(20)
twenty_diffs_bottom = twenty_diffs_bottom.drop_duplicates(subset = 'firm_id', keep = 'first').head(20)
In [115]:
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])

sb.pointplot(x = twenty_diffs_top.firm_ticker, y = twenty_diffs_top.difference)

plt.suptitle('Top 20 Most Inaccurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('as Twenty-Year Averages from 1999 - 2019', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Prediction Error', size = 17)

plt.savefig(PATH_BIVARIATE + 'avgs-diff-twenty-year-top.png')

The 9 most inaccurately predicted firm tickers are AIG, LRCX, GM, BKNG, CHTR, AGN, GOOGL, and NVR.

In [116]:
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])

sb.pointplot(x = twenty_diffs_bottom.firm_ticker, y = twenty_diffs_bottom.difference)

plt.suptitle('Top 20 Most Accurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('as Twenty-Year Averages from 1999 - 2019', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Prediction Error', size = 17)

plt.savefig(PATH_BIVARIATE + 'avgs-diff-twenty-year-bottom.png')

At first, the trend appears unstable, and this is a tricky and wrong assumption. Visualizing that absolute value of each point proves that this trend among the top 20 most accurately predicted firms is a steady, linear pattern.

Quarterly

In [117]:
#isolate twenty-year average type
quarter_diffs = all_diffs[all_diffs['Average Type'] == 'Quarter']
In [118]:
#create temp DF to get absolute mean difference of all firms
quarter_diffs_means = quarter_diffs.groupby('firm_ticker').mean().reset_index()
In [119]:
#get firm tickers of top/bottom 20 firms
quarter_diffs_top_ids = quarter_diffs_means.sort_values(by = 'difference_abs', ascending = False).head(20).firm_ticker.values
quarter_diffs_bottom_ids = quarter_diffs_means.sort_values(by = 'difference_abs', ascending = True).head(20).firm_ticker.values
In [120]:
#filter quarterly DF for top/bottom firm tickers
quarter_diffs_top = quarter_diffs[quarter_diffs.firm_ticker.isin(quarter_diffs_top_ids)]
quarter_diffs_bottom = quarter_diffs[quarter_diffs.firm_ticker.isin(quarter_diffs_bottom_ids)]
In [121]:
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])

sb.pointplot(x = quarter_diffs_top.firm_ticker, y = quarter_diffs_top.difference, order = quarter_diffs_top_ids)

plt.suptitle('Top 20 Most Inaccurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('as Quarterly Averages from 1999 - 2019', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Average Prediction Error', size = 17)

plt.savefig(PATH_BIVARIATE + 'avgs-diff-quarter-top.png')

The 9 most inaccurately predicted firms by quarterly average prediction error are AIG, LRCX, GM, BKNG, CHTR, AGN, GOOGL, NVR, and AZO.

In [122]:
#plot firm IDs as CATEGORICAL variables by their respective quarterly differences
plt.figure(figsize = [20, 15])

sb.pointplot(x = quarter_diffs_bottom.firm_ticker, y = quarter_diffs_bottom.difference, order = quarter_diffs_bottom_ids)

plt.suptitle('Top 20 Most Accurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('as Quarterly Averages from 1999 - 2019', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Average Prediction Error', size = 17)

plt.savefig(PATH_BIVARIATE + 'avgs-diff-quarter-bottom.png')

Similar to the previous twenty-year prediction error point plot, this graph shows a general oscillation between negative and positive values. However, this does not matter in the long run, because the absolute distance between 0 and all the above y-values decreases.

Yearly

In [123]:
#isolate yearly average types
year_diffs = all_diffs[all_diffs['Average Type'] == 'Year']
In [124]:
#create temp DF to get absolute mean difference for all firms
year_diffs_means = year_diffs.groupby('firm_ticker').mean().reset_index()
In [125]:
#get firm tickers of top/bottom 20 firms
year_diffs_top_ids = year_diffs_means.sort_values(by = 'difference_abs', ascending = False).head(20).firm_ticker.values
year_diffs_bottom_ids = year_diffs_means.sort_values(by = 'difference_abs', ascending = True).head(20).firm_ticker.values
In [126]:
#filter yearly DF for top/bottom firm tickers
year_diffs_top = year_diffs[year_diffs.firm_ticker.isin(year_diffs_top_ids)]
year_diffs_bottom = year_diffs[year_diffs.firm_ticker.isin(year_diffs_bottom_ids)]
In [127]:
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])

sb.pointplot(x = year_diffs_top.firm_ticker, y = year_diffs_top.difference, order = year_diffs_top_ids)

plt.suptitle('Top 20 Most Inaccurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('as Yearly Averages from 1999 - 2019', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Average Prediction Error', size = 17)

plt.savefig(PATH_BIVARIATE + 'avgs-diff-year-top.png')

The top 7 most inaccurately predicted firms by yearly average prediction errors are AIG, LRCX, GM, BKNG, CHTR, and NVR.

In [128]:
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])

sb.pointplot(x = year_diffs_bottom.firm_ticker, y = year_diffs_bottom.difference, order = year_diffs_bottom_ids)

plt.suptitle('Top 20 Most Accurately Predicted Firms', size = 20, y = .93)
plt.title('as Yearly Averages from 1999 - 2019', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Average Prediction Error', size = 17)

plt.savefig(PATH_BIVARIATE + 'avgs-diff-year-bottom.png')

Consistent with previous trends, this graph of the 10 most accurate firms by yearly average predicted errors is varied in its collection of positive and negative values inching closer towards zero.

Twenty-Year, Quarterly, and Yearly

Main question: How do firms differ by forecast inaccuracy by average type (yearly, quarterly, and yearly)?

Steps:

1) Calculate the average yearly, quarterly, and twenty-difference for each firm.

2) Convert the new average differences to their absolute value.

3) Isolate the top 20 and bottom 20 firms by their average absolute distances.

4) Convert Firm IDs to Firm Tickers

5) Use those firm tickers to create a visualization using 2 different datasets:

  • the avgs DF (all raw data)
  • the all_diffs DF (filtered data)
In [129]:
#calculate average difference for all average types
all_diffs_means = all_diffs.groupby('firm_ticker').mean().reset_index()
In [130]:
#get firm tickers of top/bottom 20 firms
all_diffs_top_ids = all_diffs_means.sort_values(by = 'difference_abs', ascending = False).head(20).firm_ticker.values
all_diffs_bottom_ids = all_diffs_means.sort_values(by = 'difference_abs', ascending = True).head(20).firm_ticker.values
In [131]:
#filter DF for top/bottom firm tickers
all_diffs_top = all_diffs[all_diffs.firm_ticker.isin(all_diffs_top_ids)]
all_diffs_bottom = all_diffs[all_diffs.firm_ticker.isin(all_diffs_bottom_ids)]
In [132]:
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])

sb.pointplot(x = all_diffs_top.firm_ticker, y = all_diffs_top.difference, order = all_diffs_top_ids, errwidth = 0.5)

plt.suptitle('Top 20 Most Inaccurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('across All Average Types', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Average Prediction Error', size = 17)

plt.savefig(PATH_BIVARIATE + 'avgs-diff-all-top.png')

The top 8 most inaccurate firms by all average types are AIG, LRCX, GM, BKNG, CHTR, AGN, NVR, and GOOGL.

In [133]:
#plot firm IDs as CATEGORICAL variables by their respective twenty-year differences
plt.figure(figsize = [20, 15])

sb.pointplot(x = all_diffs_bottom.firm_ticker, y = all_diffs_bottom.difference, order = all_diffs_bottom_ids, errwidth = 0.5)

plt.suptitle('Top 20 Most Accurately Predicted EPS by Firm', size = 20, y = .93)
plt.title('across All Average Types', size = 17)
plt.xlabel('Firm Ticker', size = 17)
plt.ylabel('Average Prediction Error', size = 17)

plt.savefig(PATH_BIVARIATE + 'avgs-diff-all-bottom.png')

The same trend of all values' absolute values steadily approaching 0. Nothing new.

Now, let's look at the top 20 most inaccurate firms, where I isolated the first couple of firms before the general trend started normalizing, approaching to 0.

  • twenty-year average prediction errors: AIG, LRCX, GM, BKNG, CHTR, AGN, GOOGL, and NVR.

  • quarterly average prediction errors: AIG, LRCX, GM, BKNG, CHTR, AGN, GOOGL, NVR, and AZO.

  • yearly average prediction errors: AIG, LRCX, GM, BKNG, CHTR, and NVR.

  • all average prediction errors: AIG, LRCX, GM, BKNG, CHTR, AGN, NVR, and GOOGL.

From the previous summaries, some interesting observations pop up:

Observation 1: AIG is consistently the most inaccurately-predicted firm across all twenty-year, quarterly, and/or yearly average types.

Observation 2: The firms in common—AIG, LRCX, GM, BKNG, CHTR, and NVR—are all ranked in the same order on the x-axis. Across all average types, these firms are consistent in their order.

Observation 3: AGN and GOOGL appears in all the above average types except yearly. This means GOOGL EPS more accurately forecasted on a solely by-year basis.

Observation 4: AZO appears only in the top 20 most inaccurate quarterly average prediction error DataFrame.

Isolate Twenty-Year EPS data by firm

In [134]:
#helper function to separate eps_act and eps_fc averages into their own columns
def separate_eps_fc_act(df, groupby_arr, value_col):
    #combine eps_fc and eps_act averages into a single column, joined on firm id and time period
    df1 = df.groupby(groupby_arr)[value_col].apply(lambda x: ', '.join(x.astype(str))).reset_index()
    
    #separate eps_fc and eps_act averages then rename columns
    df1 = pd.concat([df1[groupby_arr], df1[value_col].str.split(', ', expand = True)], axis = 1)
    df1.rename(columns = {0: 'eps_fc', 1: 'eps_act'}, inplace = True)
    
    #convert data types
    df1 = df1.astype({'eps_fc' : 'float64', 'eps_act' : 'float64'})
    
    #add intercept
    df1['intercept'] = 1
    
    return df1
In [135]:
avgs_twenty_ols = separate_eps_fc_act(avgs_twenty, ['firm_id', 'average_type'], 'average')
avgs_twenty_ols.sample(10)
Out[135]:
firm_id average_type eps_fc eps_act intercept
270 270 twenty_year 0.695905 0.495466 1
440 440 twenty_year 1.058321 0.722410 1
1 1 twenty_year 0.339879 -0.388520 1
343 343 twenty_year 0.523108 0.090036 1
186 186 twenty_year 0.814634 0.644469 1
173 173 twenty_year 0.358250 0.329103 1
386 386 twenty_year 0.913381 0.781269 1
288 288 twenty_year 1.692273 0.998828 1
401 401 twenty_year 1.000405 1.030357 1
432 432 twenty_year 0.598333 0.557525 1
In [136]:
ax = sb.lmplot(data= avgs_twenty_ols, x = 'eps_fc', y = 'eps_act', height = 10, 
          scatter_kws = {'color' : 'black'}, line_kws = {'color' : 'orchid'})

plt.title('Actual vs. Forecasted EPS', size = 25)
plt.suptitle('as Twenty-Year Averages', size = 17)
plt.xlabel('Forecasted EPS', size = 15)
plt.ylabel('Actual EPS', size = 15)

plt.savefig(PATH_BIVARIATE + 'avgs-act-fc-twenty.png')

Observation 1: The relationship between actual and forecasted EPS shows a wide variance in y-values from x = 0.0 to x = 9.0.

Observation 2: Most of the points are clustered between 0.0 and 2.5 on the x-axis. Besides that cluster, all other points are outliers, which contribute not only to the scatterplot's visible variance, but also lead the trend towards a general positive linear trend (after ignoring the outlier at x = 5.0, y = 0.25, of course)

Isolate Quarterly EPS data by firm

In [137]:
avgs_quarter_ols = separate_eps_fc_act(avgs_quarter, ['firm_id', 'average_type', 'time_period'], 'average')
avgs_quarter_ols.sample(10)
Out[137]:
firm_id average_type time_period eps_fc eps_act intercept
1212 303 quarterly q1 0.063429 0.320114 1
272 68 quarterly q1 0.598762 0.557332 1
1571 392 quarterly q4 0.358524 0.238000 1
916 229 quarterly q1 0.248684 0.065357 1
18 4 quarterly q3 1.361571 0.968750 1
1671 417 quarterly q4 0.509800 0.240476 1
1357 339 quarterly q2 0.436667 -0.170952 1
1914 478 quarterly q3 0.922571 0.905238 1
1431 357 quarterly q4 0.401095 0.397375 1
1035 258 quarterly q4 0.880286 0.817866 1
In [138]:
ax = sb.lmplot(data= avgs_quarter_ols, x = 'eps_fc', y = 'eps_act', height = 10, 
          scatter_kws = {'color' : 'black'}, line_kws = {'color' : 'orchid'})

plt.title('Actual vs. Forecasted EPS', size = 20)
plt.suptitle('as Quarterly Averages', size = 17)
plt.xlabel('Forecasted EPS', size = 15)
plt.ylabel('Actual EPS', size = 15)

plt.savefig(PATH_BIVARIATE + 'avgs-act-fc-quarter.png')

Observation 1: Unlike the previous graph showing the twenty-year averages of forecasted and actual EPS, the 95% confidence interval for quarterly averages is much more narrow.

Observation 2: There is a similar pattern of all points being "clustered" at the beginning, with outliers being abundant outside of the initial x-range. Here, that range is from x = -1 to x = 3.5.

Observation 3: Outliers among EPS quarterly averages are much more varied and pronounced, which causes the regression line to be much less steep than the regression line for twenty-year EPS averages.

Isolate Yearly EPS data by firm

In [139]:
avgs_year_ols = separate_eps_fc_act(avgs_year, ['firm_id', 'average_type', 'time_period'], 'average')
avgs_year_ols.sample(10)
Out[139]:
firm_id average_type time_period eps_fc eps_act intercept
3410 162 yearly 2007 0.585500 0.452500 1
189 9 yearly 1999 0.085750 0.122812 1
91 4 yearly 2006 NaN NaN 1
4823 229 yearly 2013 0.363750 -1.082500 1
4336 206 yearly 2009 0.296750 0.322500 1
2581 122 yearly 2018 0.866250 0.965000 1
2948 140 yearly 2007 0.447333 0.582500 1
9518 453 yearly 2004 NaN NaN 1
8130 387 yearly 2002 NaN -0.587500 1
2196 104 yearly 2011 0.197500 0.188750 1
In [140]:
ax = sb.lmplot(data= avgs_year_ols, x = 'eps_fc', y = 'eps_act', height = 10, 
          scatter_kws = {'color' : 'black'}, line_kws = {'color' : 'orchid'})

plt.title('Actual vs. Forecasted EPS', size = 20)
plt.suptitle('as Yearly Averages', size = 17)
plt.xlabel('Forecasted EPS', size = 15)
plt.ylabel('Actual EPS', size = 15)

plt.savefig(PATH_BIVARIATE + 'avgs-act-fc-year.png')

Observation 1: There is significantly less variance among yearly EPS averages, with only 2 outliers at x = -4 and x = 0.5.

Observation 2: Compared to twenty-year and quarterly EPS averages, the 95% confidence interval for yearly EPS averages is much more narrow than both of them.

Observation 3: Residuals tend to stay closer to the regression line, implying that there is lower bias in the relationship between forecasted and actual yearly EPS averages.

Isolate Raw EPS data by firm

In [141]:
features_eps_fc_act = features.loc[features['feature'].isin(['eps_act', 'eps_fc'])]
features_all_ols = separate_eps_fc_act(features_eps_fc_act, ['firm_id', 'term'], 'value')
features_all_ols.sample(10)
Out[141]:
firm_id term eps_fc eps_act intercept
29645 352 2018Q2 47.207 55.900000 1
41368 492 2009Q1 -0.326 -0.511581 1
11772 140 2002Q1 0.115 0.210000 1
24142 287 2007Q3 0.703 1.620000 1
35980 428 2006Q1 NaN NaN 1
37841 450 2009Q2 0.221 0.200000 1
15068 179 2007Q1 NaN NaN 1
18869 224 2012Q2 2.225 2.400000 1
8568 102 1999Q1 0.402 0.420000 1
9184 109 2006Q1 0.052 0.050000 1
In [142]:
ax = sb.lmplot(data= features_all_ols, x = 'eps_fc', y = 'eps_act', height = 10, 
          scatter_kws = {'color' : 'black'}, line_kws = {'color' : 'orchid'})

plt.suptitle('Actual vs. Forecasted EPS', size = 20)
plt.xlabel('Forecasted EPS', size = 15)
plt.ylabel('Actual EPS', size = 15)

plt.savefig(PATH_BIVARIATE + 'features-act-fc-all.png')

Observation 1: The raw actual vs. forecasted EPS help paint a better overall picture of the trends.

Observation 2: The previous graphs depicting averages instead of raw data have clustered points at the beginning of the x-axis. But here, the scatterplot starts with 4 outliers until the first "cluster" of points takes place.

Observation 3: From x = 0 onward, the initial cluster of points gradually declusters afterwards while maintaining minimal distance from the regression line as residuals. Out of all the regression plots depicting actual vs. forecasted EPS, the raw data displays the least amount of variance. This potentially could mean the lowest amount of bias as well.

Observation 4: The slope for raw EPS data is less steep than all of the three previous graphs, and this relationship depicts a slight positive linear relationship: the weakest linear trend out of them all.

Observation 1: Interestingly enough, the graph encompassing the broadest data (all data from features) has the most stable 95% confidence interval as per the regression line.

Observation 2: The quarterly average regression line is the most unstable.

"Naive Predictions" vs "Expert Predictions"

1) Create a DF depicting eps_act last quarter vs eps_act this quarter.

2) Create "naive predictions"

3) Compare my predictions to all eps_fc values

In [143]:
#isolate eps_act entries, drop date column
features_naive_eps = features[features['feature'] == 'eps_act'].loc[:, features.columns != 'date']
In [144]:
#grab previous eps_act term
features_naive_eps['previous_value'] = features_naive_eps['value'].shift(1)

features_naive_eps = features_naive_eps[['firm_id', 'feature','term', 'value', 'previous_value']]
In [145]:
features_naive_eps.head()
Out[145]:
firm_id feature term value previous_value
125240 0 eps_act 1999Q1 0.16 NaN
125241 0 eps_act 1999Q2 0.35 0.16
125242 0 eps_act 1999Q3 0.30 0.35
125243 0 eps_act 1999Q4 0.32 0.30
125244 0 eps_act 2000Q1 0.29 0.32

"Naive Prediction" = the moving average of the values from the last 2 quarters.

In [146]:
#create naive prediction
features_naive_eps['naive_prediction'] = features_naive_eps.value.rolling(2).mean()
In [147]:
#preview
features_naive_eps.head(5)
Out[147]:
firm_id feature term value previous_value naive_prediction
125240 0 eps_act 1999Q1 0.16 NaN NaN
125241 0 eps_act 1999Q2 0.35 0.16 0.255
125242 0 eps_act 1999Q3 0.30 0.35 0.325
125243 0 eps_act 1999Q4 0.32 0.30 0.310
125244 0 eps_act 2000Q1 0.29 0.32 0.305
In [148]:
#create column showing corresponding eps_fc for that term

#create Series of eps_fc values
features_fc = features[features.feature == 'eps_fc'].drop(columns = ['date', 'feature'])
In [149]:
features_fc.head()
Out[149]:
firm_id term value
82820 0 1999Q1 NaN
82821 0 1999Q2 NaN
82822 0 1999Q3 NaN
82823 0 1999Q4 NaN
82824 0 2000Q1 NaN
In [150]:
#merge eps_fc values with naive predictions on 'firm_id' and 'term'
features_naive_eps = features_naive_eps.merge(features_fc, on = ['term', 'firm_id'], how = 'left')
features_naive_eps.rename(columns = {'value_x' : 'value', 'value_y' : 'eps_fc_value'}, inplace = True)
In [151]:
#create 3-year moving averages for all numerical fields
features_naive_eps['ma_value'] = features_naive_eps.value.rolling(12).mean()
features_naive_eps['ma_previous_value'] = features_naive_eps.previous_value.rolling(12).mean()
features_naive_eps['ma_naive_prediction'] = features_naive_eps.naive_prediction.rolling(12).mean()
features_naive_eps['ma_eps_fc_value'] = features_naive_eps.eps_fc_value.rolling(12).mean()
In [152]:
features_naive_eps.sample(5)
Out[152]:
firm_id feature term value previous_value naive_prediction eps_fc_value ma_value ma_previous_value ma_naive_prediction ma_eps_fc_value
35152 418 eps_act 2009Q1 0.160000 0.240 0.200000 0.413 0.539167 0.555000 0.547083 0.540250
16888 201 eps_act 2000Q1 0.460000 0.495 0.477500 0.438 NaN NaN NaN 0.679250
39932 475 eps_act 2007Q1 0.430000 0.390 0.410000 0.370 0.345833 0.333333 0.339583 NaN
12731 151 eps_act 2010Q4 0.965334 1.530 1.247667 0.697 0.725445 NaN NaN 0.925333
18131 215 eps_act 2016Q4 1.020000 3.070 2.045000 2.367 2.760833 2.862500 2.811667 2.914750
In [153]:
features_naive_eps.head()
Out[153]:
firm_id feature term value previous_value naive_prediction eps_fc_value ma_value ma_previous_value ma_naive_prediction ma_eps_fc_value
0 0 eps_act 1999Q1 0.16 NaN NaN NaN NaN NaN NaN NaN
1 0 eps_act 1999Q2 0.35 0.16 0.255 NaN NaN NaN NaN NaN
2 0 eps_act 1999Q3 0.30 0.35 0.325 NaN NaN NaN NaN NaN
3 0 eps_act 1999Q4 0.32 0.30 0.310 NaN NaN NaN NaN NaN
4 0 eps_act 2000Q1 0.29 0.32 0.305 NaN NaN NaN NaN NaN
In [154]:
fig = plt.figure(figsize = [10, 10])

sb.scatterplot(data = features_naive_eps, x = 'naive_prediction', y = 'value', color = 'green')
sb.scatterplot(data = features_naive_eps, x = 'eps_fc_value', y = 'value', color = 'purple')

fig.legend(labels = ['My Forecasted Eps', 'Forecasted EPS'], prop = {'size' : 15}, loc = 'lower right')

plt.suptitle('Personal EPS Forecasts vs. Actual EPS', size = 20, y = .94)
# plt.title('in 3-year Moving Averages', size = 17)
plt.ylabel('EPS Value')
plt.xlabel('Actual EPS')

plt.savefig(PATH_MULTIVARIATE + 'features-naive-act-fc-scatter.png')
plt.show();

Observation 1: With actual EPS as the x-value, we can see that my personal forecasts show much more variance with outliers spreading across the entire scatterplot, in all 4 quartiles. If we ignored the outliers, then all of my personal forecasts cluster around (0, 0).

Observation 2: Similarly, experts' forecasted EPS has only 3 outliers and generally stays uniformly clustered around (0,0). This means that my approach of generating personal forecasts by 2-quarter averages is much less accurate than the method that Bloomberg forecasters had used.

In [155]:
fig = plt.figure(figsize = [10, 10])

sb.regplot(data = features_naive_eps, x = 'naive_prediction', y = 'value', color = 'green')
sb.regplot(data = features_naive_eps, x = 'eps_fc_value', y = 'value', color = 'purple')

fig.legend(labels = ['My Forecasted EPS', 'Forecasted EPS'], prop = {'size' : 15}, loc = 'lower right')

plt.suptitle('Forecasted EPS vs. Actual EPS', size = 20, y = .94)
# plt.title('in 3-year Moving Averages', size = 17)
plt.ylabel('EPS')
plt.xlabel('Actual EPS')

plt.savefig(PATH_MULTIVARIATE + 'features-naive-act-fc-reg.png')
plt.show();

Observation 1: Not surprisingly, my personal forecasted EPS has a much wider 95% confidence interval than experts' forecasted EPS. Due to the wider variance for the former, those 2-quarter rolling averages would subsequently come equipped with a broader confidence interval scope.

Observation 2: Experts' forecasted EPS, while counting outliers, exhibits a greater rate of increase compared to my personal forecasts after x = 0. This means that when compared to actual EPS values, experts' forecasts tend to surpass my personal forecasts.

Observation 3: When ignoring outliers, experts' forecasted EPS and my personal forecasts clusters uniformly around (0, 0), with slightly more of my personal forecasts taking up the left side of the "cluster:" before x = 0. This confirms my previous statement that although both variables show a general consistent positive linear trend, my personal forecasts are less reliable and more inaccurate due to having more variance, outliers insonsistent with the pairing regression line trend, and moving-averages depend solely on historical data.

Observation 4: Most of my personally generated outliers reside in Quartile I and the x-axis side of Quartile IV. All experts' forecasted EPS outliers reside in Quartiles II and III. Both forecasts' outliers are scattered in completely different quartiles from each other. This observation helps to further accentuate that although both regression lines share a linear positive trend, the distribution of outliers shows that both variables are not necessarily correlated to each other in terms of actual EPS.

In [156]:
fig = plt.figure(figsize = [10, 10])

sb.scatterplot(data = features_naive_eps, x = 'naive_prediction', y = 'value', color = 'purple')

plt.suptitle('Actual EPS vs. My Predicted EPS', size = 20, y = .94)
plt.ylabel('Actual EPS')
plt.xlabel('My Predicted EPS')

plt.savefig(PATH_MULTIVARIATE + 'features-naive-act-scatter.png')
plt.show();
In [157]:
fig = plt.figure(figsize = [10, 10])

sb.regplot(data = features_naive_eps, x = 'naive_prediction', y = 'value', color = 'purple')

plt.suptitle('Actual EPS vs. My Predicted EPS', size = 20, y = .94)
plt.ylabel('Actual EPS')
plt.xlabel('My Predicted EPS')

plt.savefig(PATH_MULTIVARIATE + 'features-naive-act-reg.png')
plt.show();

V) Multivariate Exploration

"Naive Predictions" vs. "Expert Predictions"

In [158]:
value_vars = ['value', 'previous_value', 'naive_prediction', 'eps_fc_value', 'ma_value', 'ma_previous_value',
             'ma_naive_prediction', 'ma_eps_fc_value']
In [159]:
#melt naive_predictions and eps_fc_value to prepare for plotting
df_naive_eps_melt = pd.melt(features_naive_eps, id_vars = ['firm_id', 'feature', 'term'],
       value_vars = value_vars,
       var_name = 'value_type')
In [160]:
#isolate only naive predictions and eps_fc_value
naive_eps_fc = df_naive_eps_melt[df_naive_eps_melt.value_type.isin(['naive_prediction', 'eps_fc_value', 'value'])]
In [161]:
naive_eps_fc['ma_value'] = naive_eps_fc.value.rolling(12).mean()
D:\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
In [162]:
naive_eps_fc.sample(5)
Out[162]:
firm_id feature term value_type value ma_value
2820 33 eps_act 2011Q1 value 0.710 -0.236667
144359 203 eps_act 2010Q4 eps_fc_value 0.474 0.359083
21403 254 eps_act 2015Q4 value 0.880 0.669077
130427 37 eps_act 2013Q4 eps_fc_value 1.805 1.446583
133688 76 eps_act 2010Q1 eps_fc_value 0.080 0.116000
In [163]:
plt.figure(figsize = [20, 10])
ax = sb.pointplot(x = 'term', y = 'value', data = naive_eps_fc, hue = 'value_type',
                  errwidth = .5, palette = 'Set2')

plt.xticks(rotation = 'vertical')

leg_handles = ax.get_legend_handles_labels()[0]
ax.legend(leg_handles, ['Actual EPS', 'My Forecasted EPS', 'Forecasted EPS'], prop = {'size' : 20})

plt.title('Average Forecasted EPS and My Predicted EPS by Term', size = 20)
plt.ylabel('Average EPS Value')
plt.xlabel('Term')

plt.savefig(PATH_MULTIVARIATE + 'features-naive-eps.png')
plt.show();
In [164]:
#isolate actual EPS and personal predicted EPS
ma_naive_fc = df_naive_eps_melt[df_naive_eps_melt.value_type.isin(['ma_naive_prediction', 'ma_value', 'ma_eps_fc_value'])]
In [165]:
ma_naive_fc.value_type.unique()
Out[165]:
array(['ma_value', 'ma_naive_prediction', 'ma_eps_fc_value'], dtype=object)
In [166]:
plt.figure(figsize = [20, 10])
ax = sb.pointplot(x = 'term', y = 'value', data = ma_naive_fc, hue = 'value_type',
                  errwidth = .5, palette = "Set2")

plt.xticks(rotation = 'vertical')

leg_handles = ax.get_legend_handles_labels()[0]
ax.legend(leg_handles, ['Actual EPS', 'My Forecasted EPS', 'Forecasted EPS'], prop = {'size' : 20})

plt.suptitle('Average Forecasted EPS and My Predicted EPS by Term', size = 20, y = .95)
plt.title('in 3-Year Moving Averages from 1999 - 2019', size = 17)
plt.ylabel('Moving Average EPS Value')
plt.xlabel('Term')

plt.savefig(PATH_MULTIVARIATE + 'features-naive-eps-ma.png')
plt.show();

Observation 1: In the first figure, my predicted EPS has strayed drastically away from the more normal actual EPS general trend pattern in 2 instances. Between 2000Q4 and 2001Q1, my forecasts become optimistic, forming a 2-point peak before jumping back into the uniform trend. Between 2008Q4 and 2009Q1, my forecasts become pessimistic, forming a 2-point trough before hopping back up into the uniform trend similar to the actual EPS trend.

Observation 2: When plotting moving averages per term, both trend lines follow an almost-identical polynomial trend, overall leading into a direction of a positive slope.

The technique that the Bloomberg forecasters used was much more effective than my method of predicting by using 2-quarter moving averages.

Time vs. Percentage Error vs. Firm IDs

In [167]:
#add percentage_error column to features
features_pct = features[features.feature.isin(['eps_fc', 'eps_act'])]
In [168]:
features_pct = separate_eps_fc_act(features_pct, ['firm_id', 'term'], 'value')
In [169]:
features_pct['pct_error'] = (features_pct.eps_act - features_pct.eps_fc) / features_pct.eps_act
In [170]:
features_pct['pct_error'] = features_pct['pct_error'].replace([np.inf, -np.inf], np.nan)
In [171]:
#get absolute value of percentage errors
features_pct['pct_error_abs'] = features_pct.pct_error.abs()
In [172]:
#add ticker column for plotting assistance
features_pct['firm_ticker'] = convert_ids_to_tickers(features_pct.firm_id)
In [173]:
#add 3-year moving averages for percentage error
features_pct['ma_pct_error'] = features_pct.pct_error.rolling(12).mean()
In [174]:
#grab top/bottom 10 firms
features_pct_top_ids = features_pct.sort_values(by = 'pct_error_abs', ascending = False).firm_id.drop_duplicates().head(5).values
features_pct_bottom_ids = features_pct.sort_values(by= 'pct_error_abs', ascending = True).firm_id.drop_duplicates().head(5).values
In [175]:
#filter DF for those 10 firms
features_pct_top = features_pct[features_pct.firm_id.isin(features_pct_top_ids)]
features_pct_bottom = features_pct[features_pct.firm_id.isin(features_pct_bottom_ids)]

Percentage Error by Year

In [176]:
plt.figure(figsize = [20, 15])
ax = sb.pointplot(data = features_pct_top, x = features_pct_top.term.dt.year, y = 'pct_error',
                  hue = 'firm_ticker', dodge = True, legend_out = False, errwidth = 0.5)


plt.xlabel('Year')
plt.ylabel('Average Percentage Error')
plt.title('Average Percentage Error by Year (for the Top 5 Most Inaccurate Firms)', size = 20)

plt.savefig(PATH_MULTIVARIATE + 'pct-firm-year-top.png')
plt.show()

The above visual isn't great for analysis. Let's look at a strip plot distribution instead.

In [177]:
plt.figure(figsize = [20, 10])
sb.stripplot(data = features_pct_top, x = features_pct_top.term.dt.year, y = 'pct_error', hue = 'firm_ticker')
plt.xticks(rotation = 'vertical')

plt.suptitle('Distribution of EPS Percentage Error by Year', size = 20, y = .94)
plt.title('for the Top 5 Most Inaccurate Firms', size = 17)

plt.xlabel('Year')
plt.ylabel('Percentage Error')

plt.savefig(PATH_MULTIVARIATE + 'pct-firm-year-top-strip.png')

Observation 1: The most notable outlier is IRM, with a percentage error of over -1000 for EPS in the year 2014.

Observation 2: BIM, IRM, MCK, PXD, and QRVO are the five most inaccurately predicted firms in terms of percentage error by year.

Observation 3: All other firms beside IRM share large distance gaps from 2014 - 2018 with IRM in 2014.

Observation 4: Outliers only started showing up in distributions starting on the year 2014. Therefore, this means that EPS forecasts have become more inaccurate in the more recent years, starting from 2014.

Percentage Error by Quarter

In [178]:
plt.figure(figsize = [10, 10])
ax = sb.stripplot(data = features_pct_top, x = features_pct_top.term.dt.quarter, y = 'pct_error',
                  hue = 'firm_ticker', jitter = True)

plt.suptitle('Distribution of EPS Percentage Error by Quarter', size = 15, y = .94)
plt.title('for the Top 5 Most Inaccurate Firms', size = 13)
plt.ylabel('Average Percentage Error', size = 12)
plt.xlabel('Quarter', size = 12)
plt.legend(title = 'Firm Ticker')

plt.savefig(PATH_MULTIVARIATE + 'pct-firm-quarter-top-strip.png')

Observation 1: The most notable outlier is IRM yet again, with a percentage error of over -1000 for EPS in Quarter 3.

Observation 2: IBM, IRM, MCK, PXD, and QRVO are the five most inaccurately predicted firms in terms of percentage error by quarter. This list is the same as the previous stripplot and pointplot depicting years.

Observation 3: Quarter 4 is the only quarter with no outliers. This makes intuitive sense because as the quarters in any singular year pass and familiarity with the current fiscal year builds, the more accurate forecasted EPS would be. Let us test this theory.

In [179]:
plt.figure(figsize = [20, 15])
ax = sb.pointplot(data = features_pct_top, x = features_pct_top.term.dt.quarter, y = 'pct_error', hue = 'firm_ticker', dodge = True,
                  legend_out = False, errwidth = 0.5)

plt.xlabel('Quarter', size = 17)
plt.ylabel('Average Percentage Error', size = 17)
plt.title('Average Percentage Error by Quarter (for the Top 5 Most Inaccurate Firms)', size = 20)
plt.xticks(size = 15)
plt.legend(title = 'Firm Ticker')

plt.savefig(PATH_MULTIVARIATE + 'pct-firm-quarter.png')
plt.show()

Observation 1: The firms IRM and IBM deviate greatly in Quarter 3, where forecasters' predictions veer away from zero. Meanwhile, the other firms cluster more closely around 0. The most inaccurate predictions, on average, happen to be made during Quarter 3 in any given year.

Observation 2: The previous observation disproves my initial intuition that forecasters' predictions would become more accurate by the quarter.

Percentage Error by Year and Quarter

In [180]:
plt.figure(figsize = [22, 15])
ax = sb.pointplot(data = features_pct_top, x = 'term', y = 'pct_error', hue = 'firm_ticker', dodge = True,
                  legend_out = False)

plt.xlabel('Term')
plt.ylabel('Percentage Error')
plt.title('Percentage Error by Year and Quarter (for the Top 5 Most Inaccurate Firms)', size = 20)
plt.xticks(rotation = 'vertical')
plt.legend(title = 'Firm Ticker')

plt.savefig(PATH_MULTIVARIATE + 'pct-firm-term.png')
plt.show()

This isn't a very good visualization either. Let's look at a stripplot instead:

In [181]:
plt.figure(figsize = [20, 10])
sb.stripplot(data = features_pct_top, x = 'term', y = 'pct_error', hue = 'firm_ticker')

plt.suptitle('Distribution of EPS Percentage Error by Term', size = 20, y = .94)
plt.title('for the top 5 Most Inaccurate Firms', size = 17)

plt.xlabel('Term', size = 17)
plt.ylabel('Percentage Error', size = 17)
plt.xticks(rotation = 'vertical')
plt.legend(title = 'Firm Ticker')

plt.savefig(PATH_MULTIVARIATE + 'pct-ferm-tirm-strip.png')
plt.show();

Observation 1: The most notable outlier is IRM, with a percentage error of over -1000 for EPS in the term 2014Q3.

Observation 2: BIM, IRM, MCK, PXD, and QRVO are the five most inaccurately predicted firms in terms of percentage error by term. This list is consistent for the previous plots feature standalone quarters and years.

Observation 3: outliers start after 2014Q3. In the next terms, outliers tend to year in Q1 and Q3 exclusively; much more Q3.

Observation 4: All other firms beside IRM share large distance gaps from 2014Q3 - 2018Q3 from IRM in 2014Q3.

Observation 5: Outliers only started showing up in distributions starting in the year 2014. Therefore, this means that EPS forecasts have become more inaccurate in the more recent terms, starting from 2014Q3. This conclusion is consistent with the observation made in the quarterly and yearly stripplots & pointplots.

Prediction Error by Term by Top 5 Firms

In [182]:
features_pct['difference'] = features_pct.eps_act - features_pct.eps_fc
In [183]:
features_pct['difference_abs'] = features_pct.difference.abs()
In [184]:
diffs_top_ids = features_pct.sort_values(by = 'difference_abs', ascending = False).firm_id.drop_duplicates().head(5).values
diffs_bottom_ids = features_pct.sort_values(by = 'difference_abs', ascending = True).firm_id.drop_duplicates().head(5).values
In [185]:
diffs_top = features_pct[features_pct.firm_id.isin(diffs_top_ids)]
diffs_bottom = features_pct[features_pct.firm_id.isin(diffs_bottom_ids)]

Prediction Error by Term

In [186]:
plt.figure(figsize = [20, 10])
sb.pointplot(data = diffs_top, x = diffs_top.term, y = 'difference', hue = 'firm_ticker', errwidth = 0.5)

plt.legend(title = 'Firm Ticker')
plt.xticks(rotation = 'vertical')
plt.xlabel('Term')
plt.ylabel('Average Prediction Error')
plt.suptitle('Average EPS Prediction Error by Term', size = 20, y = .94)
plt.title('for the Top 5 Most Inaccurate Firms', size = 17)

plt.savefig(PATH_MULTIVARIATE + 'features-perror-term-firm.png')
plt.show();

Observation 1: The 2 most noticeable outliers are the firm AIG in 2008Q4 (optimistic forecasts), and LRCX in 2000Q3 (pessimistic forecasts).

Observation 2: The 5 most inaccurate firms by prediction error are AGN, AIG, CHTR, LRCX, and VRSN.

Observation 3: VRSN is the most stable of the top 5 most inaccurate firms; its average prediction error by term hugs closely to a slopeless y = 0 trend.

Prediction Error by Year

In [187]:
plt.figure(figsize = [20, 10])
sb.pointplot(data = diffs_top, x = diffs_top.term.dt.year, y = 'difference', hue = 'firm_ticker',errwidth = 0.5)

plt.legend(title = 'Firm Ticker')
plt.xticks(rotation = 'vertical')
plt.xlabel('Year')
plt.ylabel('Prediction Error')
plt.suptitle('Average EPS Prediction Error by Year', size = 20, y = .94)
plt.title('for the Top 5 Most Inaccurate Firms', size = 17)

plt.savefig(PATH_MULTIVARIATE + 'features-perror-year-firm.png')
plt.show();

Observation 1: Yet again, the 2 most noticeable outliers are the firm AIG in 2008 (optimistic forecasts), and LRCX in 2000 (pessimistic forecasts).

Observation 2: The 5 most inaccurate firms by prediction error are AGN, AIG, CHTR, LRCX, and VRSN. The same list as before.

Observation 3: VRSN is the most stable of the top 5 most inaccurate firms; it contains no outliers. Data for the firm CHTR shows up only in 2017.

Prediction Error by Quarter

In [188]:
plt.figure(figsize = [20, 10])
sb.pointplot(data = diffs_top, x = diffs_top.term.dt.quarter, y = 'difference', hue = 'firm_ticker',errwidth = 0.5)

plt.legend(title = 'Firm Ticker')
plt.xlabel('Quarter')
plt.ylabel('Prediction Error')
plt.suptitle('Average EPS Prediction Error by Quarter', size = 20, y = .94)
plt.title('for the Top 5 Most Inaccurate Firms', size = 17)
plt.xticks(size = 15)

plt.savefig(PATH_MULTIVARIATE + 'features-perror-quarter-firm.png')
plt.show();

Observation 1: One most noticable outlier is AIG, which is, on average, the only outlier in Q3 of any given year.

Observation 2: The 5 most inaccurate firms by prediction error are AGN, AIG, CHTR, LRCX, and VRSN. The same list as generated through a yearly and term basis.

Observation 3: These 5 firms display relatively low prediction errors until Q4, where both LRCX and AIG "branch off" into opposite directions. This means that forecasters, on average, are likely to be more inaccurate in their EPS forecasts in Q4 of any given year.

Observation 4: VRSN is the most stable of the top 5 most inaccurate firms; it contains no outliers. Data for the firm CHTR shows up only in 2017.

Cross-Validation Testing

find root mean square error for both "naive" forecasts and Bloomberg forecasts

In [189]:
#isolate both forecast EPS types and actual EPS
eps_fc_all = features_naive_eps[['firm_id', 'feature', 'term', 'value', 'naive_prediction', 'eps_fc_value']]
eps_fc_all = eps_fc_all.rename(columns = {'value' : 'eps_act', 'naive_prediction' : 'eps_fc_db', 'eps_fc_value' : 'eps_fc'})
In [190]:
eps_fc_all.sample(5)
Out[190]:
firm_id feature term eps_act eps_fc_db eps_fc
27306 325 eps_act 2000Q3 0.19000 0.390 NaN
9658 114 eps_act 2019Q3 2.76000 2.080 0.709
30489 362 eps_act 2019Q2 0.66000 0.670 0.633
22743 270 eps_act 2014Q4 0.60479 NaN 0.797
32153 382 eps_act 2015Q2 -0.15000 -0.155 1.957

calculate root mean square errors (RMSE) per firm

  1. calculate all predicted errors
  2. square all predicted errors
  3. calculate mean squared predicted errors (per firm)
  4. take square root of resulting values
$$ RMSE = \sqrt{\frac{\sum ((predicted - actual)^2}{N}} $$
In [191]:
#helper function
def calculate_rmse(df, predicted_col):
    rmse = np.sqrt(mean_squared_error(df['eps_act'], df[predicted_col]))
    return pd.Series(np.array(rmse))
In [192]:
temp = eps_fc_all.dropna()
In [193]:
#number of entries dropped
eps_fc_all.shape[0] - temp.shape[0]
Out[193]:
8337

We cannot include entrees with NaN values in our RMSE calculations. We dropped 8337 entries with NaN values for actual EPS, forecasted Bloomberg EPS, and personal EPS forecasts.

Original size: 42420

New size: 34083

In [194]:
#isolate Bloomberg RMSE
firm_rmse_fc = temp.groupby(['firm_id']).apply(calculate_rmse, predicted_col = 'eps_fc').reset_index().rename(columns = { 0: 'rmse_fc'})
firm_rmse_fc.rmse_fc = firm_rmse_fc.rmse_fc.astype('float64')
In [195]:
#isolate naive RMSE
firm_rmse_fc_db = temp.groupby('firm_id').apply(calculate_rmse, predicted_col = 'eps_fc_db').reset_index().rename(columns = {0: 'rmse_fc_db'})
firm_rmse_fc_db.rmse_fc_db = firm_rmse_fc_db.rmse_fc_db.astype('float64')
In [196]:
#create 1 DataFrame to contain both RMSE types
cross_val_firm = pd.merge(firm_rmse_fc, firm_rmse_fc_db, how = 'outer', on = 'firm_id')
cross_val_firm.head()
Out[196]:
firm_id rmse_fc rmse_fc_db
0 0 0.804354 0.577949
1 1 1.325708 1.041689
2 2 0.321330 0.259615
3 3 0.127809 0.283630
4 4 0.853493 0.495081
In [197]:
#add ticker column for plotting assistance
cross_val_firm['firm_ticker'] = convert_ids_to_tickers(cross_val_firm.firm_id)
In [198]:
cross_val_firm['diff'] = cross_val_firm.rmse_fc - cross_val_firm.rmse_fc_db
In [199]:
cross_val_firm_melt = pd.melt(cross_val_firm, id_vars = ['firm_id', 'firm_ticker', 'diff'],
       value_vars = ['rmse_fc', 'rmse_fc_db'],
        var_name = 'rmse_type',
       value_name = 'rmse')
In [200]:
cross_val_firm_melt.sample(5)
Out[200]:
firm_id firm_ticker diff rmse_type rmse
483 490 WRK 0.321423 rmse_fc 1.041745
921 428 SYF 0.111926 rmse_fc_db 0.091297
574 78 BXP 0.434886 rmse_fc_db 0.805590
473 480 WCG 0.207509 rmse_fc 0.723215
819 326 MS 0.014640 rmse_fc_db 0.557130

reorder by top 20 most inaccurate firms (naive AND Bloomberg rmse)

In [201]:
firm_rmse_fc_top = cross_val_firm_melt[cross_val_firm_melt['rmse_type'] == 'rmse_fc'].sort_values(by = 'rmse', ascending = False)
In [202]:
firm_rmse_fc_db_top = cross_val_firm_melt[cross_val_firm_melt['rmse_type'] == 'rmse_fc_db'].sort_values(by = 'rmse', ascending = False)
In [203]:
firm_rmse_fc_top_tickers = firm_rmse_fc_top.head(20).firm_ticker.drop_duplicates().values
firm_rmse_fc_db_top_tickers = firm_rmse_fc_db_top.head(20).firm_ticker.drop_duplicates().values
In [204]:
set(firm_rmse_fc_top_tickers) ^ set(firm_rmse_fc_db_top_tickers)
Out[204]:
{'APA', 'AVGO', 'ETFC', 'RE'}
In [205]:
#top 10-20 most inaccurate firms by Bloomberg EPS forecasted RMSE
top_firm_rmse_fc = cross_val_firm_melt[cross_val_firm_melt.firm_ticker.isin(firm_rmse_fc_top_tickers)]
top_firm_rmse_fc = top_firm_rmse_fc[top_firm_rmse_fc['rmse_type'] == 'rmse_fc'].sort_values(by = 'rmse', ascending = False)
In [206]:
plt.figure(figsize = [15, 10])
sb.pointplot(data = top_firm_rmse_fc, x = 'firm_ticker', y = 'rmse')
plt.xticks(rotation = '90')

plt.suptitle('Top 10 Most Inaccurate Firms by Bloomberg EPS RMSE', y = .94, size = 20)
plt.title('in decreasing order', size = 17)
plt.ylabel('Bloomberg EPS RMSE')
plt.xlabel('Firm Ticker')
plt.savefig(PATH_BIVARIATE + 'rmse-bloomberg.png')
plt.show();

Observation 1: The 2 most notable outlier firm tickers are LCRX and AIG, with values at approximately 68 and 65 respectively.

Observation 2: After AIG and LCRX, RMSE values for the next most inaccurate firms (in descending order) follow a more stable pattern, with a decreasing negative linear trend approaching 0.

In [207]:
#top 10-20 most inaccurate firms by naive EPS forecasted RMSE
top_firm_rmse_fc_db = cross_val_firm_melt[cross_val_firm_melt.firm_ticker.isin(firm_rmse_fc_db_top_tickers)]
top_firm_rmse_fc_db = top_firm_rmse_fc_db[top_firm_rmse_fc_db['rmse_type'] == 'rmse_fc_db'].sort_values(by = 'rmse', ascending = False)
In [208]:
#top 10-20 most inaccurate firms by Bloomberg EPS forecast RMSE
plt.figure(figsize = [15, 10])
sb.pointplot(data = top_firm_rmse_fc_db, x = 'firm_ticker', y = 'rmse')
plt.xticks(rotation = '90')

plt.suptitle('Top 10 Most Inaccurate Firms by Naive EPS RMSE', y = .94, size = 20)
plt.title('in decreasing order', size = 17)
plt.ylabel('Naive EPS RMSE')
plt.xlabel('Firm Ticker')
plt.savefig(PATH_BIVARIATE + 'rmse-naive.png')
plt.show();

Observation 1: The 2 most notable outlier firm tickers are LCRX and AIG, with values at approximately 48 and 36 respectively. These firm tickers for Naive EPS RMSE are consistent with the outlying firm tickers for the Bloomberg EPS RMSE.

Observation 2: After AIG and LCRX, RMSE values for the next most inaccurate firms (in descending order) follow a more stable pattern, with a decreasing negative linear trend approaching 0.

Observation 3: Out of 10 isolated firms, 4 firms are not shared in common between the top 10 firm tickers by decreasing RMSE value. These tickers are 'APA', 'AVGO', 'ETFC', and 'RE'.

Random 20 forecasts

In [209]:
random_tickers = cross_val_firm_melt.firm_ticker.drop_duplicates().sample(20).values
In [210]:
rand_cross_val_firm = cross_val_firm_melt[cross_val_firm_melt.firm_ticker.isin(random_tickers)]
In [211]:
cross_val_firm_lim_melt = cross_val_firm_melt[cross_val_firm_melt.firm_ticker.isin(random_tickers)]
In [212]:
cross_val_firm_lim = cross_val_firm[cross_val_firm.firm_ticker.isin(random_tickers)]
In [213]:
cross_val_firm_lim.head()
Out[213]:
firm_id rmse_fc rmse_fc_db firm_ticker diff
16 16 0.594831 0.510281 AEP 0.084550
22 22 0.980212 0.787811 AIZ 0.192401
26 26 0.228183 0.392216 ALGN -0.164033
41 42 0.779894 0.602122 ANTM 0.177772
65 66 0.088043 0.072431 BF/B 0.015613
In [ ]:
import plotly.express as px
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Scatter(
    x = cross_val_firm_lim.rmse_fc, y = cross_val_firm_lim.firm_ticker,
    name = 'Bloomberg EPS RMSE', mode = 'markers'))

fig.add_trace(go.Scatter(
    x = cross_val_firm_lim.rmse_fc_db, y = cross_val_firm_lim.firm_ticker,
    name = 'Naive EPS RMSE', mode = 'markers'))


fig.update_layout(legend_title_text = 'RMSE Type',
                 xaxis_title = 'RMSE', yaxis_title = 'Firm Ticker', title = {
                     'text' : 'RMSE Type by 20 Random Firms',
                     'xanchor' : 'center',
                     'yanchor' : 'top',
                     'y' : 0.9,
                     'x' : 0.5
                 })

fig.write_image(PATH_MULTIVARIATE + 'rmse-blmbrg-naive.png')
fig.show();

Observation 1: Out of the 20 randomly selected firm tickers, WM, VLO, TDG, MNST, LEG, HON, GD, EXR, DISCK, DAL, CMI, CELG, CDNS, ARNC, AMAT, ADSK, and A have a Bloomberg RMSE that is greater than my Naive RMSE. Those are 17 firm tickers in total, where Bloomberg forecasts were a worse fit than my Naive forecasts.

In [ ]:
plt.figure(figsize = [7, 7])
fig, ax = plt.subplots()
ax.scatter(cross_val_firm.rmse_fc_db, cross_val_firm.rmse_fc)
x = np.linspace(*ax.get_xlim())
ax.plot(x, x, color = 'r')
ax.figure.set_size_inches(10, 10)

plt.title('Bloomberg EPS RMSE vs. Naive EPS RMSE', size = 17)
plt.xlabel('Naive EPS RMSE', size = 13)
plt.ylabel('Bloomberg EPS RMSE', size = 13)
plt.savefig(PATH_BIVARIATE + 'rmse-blmbrg-naive-outliers.png')
plt.show();

All points above the red y = x line represent Bloomberg EPS RMSE that is greater than my Naive EPS RMSE.

Observation 1: For the 2 outlying points (which represent the firms LCRX and AIG), their Bloomberg RMSE is greater than my Naive RMSE. This means that my naive EPS forecasts were a better fit for these 2 firm tickers than the Bloomberg forecasts.

Observation 2: When ignoring outliers, it's still hard to draw a conclusion based on the the ratio of points above the y = x vs below. But by visualization alone, it seems that more points are above the y = x line than below.

Let's ignore out outliers and zero in on the bulk of the data.

In [ ]:
#reexamine without outliers
cross_val_firm_out = cross_val_firm.sort_values(by = 'rmse_fc_db', ascending = False).iloc[2:]
In [ ]:
plt.figure(figsize = [7, 7])
fig, ax = plt.subplots()
ax.scatter(x = cross_val_firm_out.rmse_fc_db, y = cross_val_firm_out.rmse_fc)
x = np.linspace(*ax.get_xlim())
ax.plot(x, x, color = 'r')
ax.figure.set_size_inches(10, 10)

plt.suptitle('Bloomberg EPS RMSE vs. Naive EPS RMSE', size = 17, y = .93)
plt.title('without Outlying Firms', size = 15)
plt.xlabel('Naive EPS RMSE', size = 13)
plt.ylabel('Bloomberg EPS RMSE', size = 13)
plt.savefig(PATH_BIVARIATE + 'rmse-blmbrg-naive-no-outliers.png')
plt.show();

Observation 1: My initial observation was correct: there are, in fact, more Bloomberg EPS RMSE values that are greater than my Naive EPS RMSE values per firm. This means that for the bulk majority of the data, my naive forecasts were much more reliable than the Bloomberg EPS forecasts. This is a major takeaway.

Observation 2: Of all the RMSE values, 86.95% of my Naive EPS RMSE values are less than the Bloomberg EPS RMSE values. For each firm, I subtracted their Naive RMSE from their respective Bloomberg RMSE. Any difference above 0 indicates that the Bloomberg RMSE was much larger. This percentage is consistent with the visual I produced.

In [ ]:
((cross_val_firm['diff'] > 0).sum()) / cross_val_firm.shape[0]

Prediction Errors (naive vs. Bloomberg forecasts)

In [ ]:
naive_blmbrg = features_naive_eps[['firm_id', 'term', 'value', 'eps_fc_value', 'naive_prediction']]
In [ ]:
naive_blmbrg = naive_blmbrg.rename(columns = {'value' : 'eps_act',
                              'eps_fc_value' : 'eps_fc',
                              'naive_prediction' : 'naive_fc'})
In [ ]:
naive_blmbrg.head()
In [ ]:
#add raw prediction errors
naive_blmbrg['raw_pred_err'] = naive_blmbrg['eps_act'] - naive_blmbrg['eps_fc']
In [ ]:
naive_blmbrg['naive_pred_err'] = naive_blmbrg['eps_act'] - naive_blmbrg['naive_fc']
In [ ]:
naive_blmbrg.head()
In [ ]:
#melt for easier plotting
naive_blmbrg = pd.melt(naive_blmbrg, id_vars = ['firm_id', 'term', 'eps_act', 'eps_fc', 'naive_fc'], value_vars = ['raw_pred_err', 'naive_pred_err'],
       var_name = 'err_type', value_name = 'err_value')
In [ ]:
plt.figure(figsize = [20, 15])
ax = sb.pointplot(data = naive_blmbrg, x = 'term', y = 'err_value', hue = 'err_type', errwidth = 0.5, scale = .65)

plt.suptitle('Average Raw Prediction Error Types by Term', size = 20, y = .93)
plt.title('across All Firms', size = 17)
plt.legend(title = 'Error Type')
leg_handles = ax.get_legend_handles_labels()[0]
ax.legend(leg_handles, ['Bloomberg Prediction Error', 'Naive Prediction Error'], prop = {'size' : 17})
plt.xticks(rotation = 'vertical')
plt.xlabel('Term')
plt.ylabel('Error Value')

plt.savefig(PATH_MULTIVARIATE + 'blmbrg-naive-pred-term.png')
plt.show();

Observation 1: There are 2 terms where Bloomberg prediction errors are notable: 2008Q4 and 2000Q4. Bloomberg forecasters were overly optimistic in 2008Q4, while overly pessimistic in 2000Q4 (on average).

Observation 2: There are 4 terms where my naive prediction errors are notable: 2000Q4, 2001Q1, and 2008Q4, 2009Q1. My naive forecasts were overly optimistic in 2001Q1 and 2008Q4 (the latter being consistent with the Bloomberg prediction errors). My naive forecasts were overly pessimistic in 2000Q4 and 2001Q1 (the former being consistent with the Bloomberg prediction errors).

Observation 3: From 2012Q3 onwards, my average naive forecasts stay closer to 0 while the Bloomberg forecasts tend to be lesser than the naive forecasts: veering away from 0. This means that in the more recent years since 2012, Bloomberg EPS forecasts were less accurate than my own naive forecasts.

Observation 4: My average naive forecasts and Bloomberg forecasts both roughly "equalize", following the same trend "clustering" around 0, 2003Q4 to 2007Q4.

Observation 5: Look at 2000Q4 and 2001Q4. In the former, my naive forecasts tend to be overly pessimistic, but in the quarter immediately after that, they make a complete reverse, and tend to become overly optimistic.

Observation 6: The previous pattern is reversed for 2008Q4 and 2009Q1. In the former, my naive forecasts tend to be overly optimistic, but in the quarterly immediately after, they tend to be become overly pessimistic.

Observation 7: Expanding on the previous 2 observations, it seems that my naive prediction errors change "mood" when entering a new year from 2000 to 2001 and 2008 to 2009. I'd like to take special note that my naive forecasts spiked in pessimism in 2009Q1: only right before the 2009 Recession began to recover. I also find it curious that Bloomberg forecasters were overly optimistic in their EPS forecasts during the harder years of the bear market.

In [ ]:
plt.figure(figsize = [20, 15])
ax = sb.pointplot(data = naive_blmbrg, x = naive_blmbrg.term.dt.year, y = 'err_value', hue = 'err_type', errwidth = 0.5, scale = .65)

plt.suptitle('Average Raw Prediction Error Types by Year', size = 20, y = .93)
plt.title('across All Firms', size = 17)
plt.legend(title = 'Error Type')
leg_handles = ax.get_legend_handles_labels()[0]
ax.legend(leg_handles, ['Bloomberg Prediction Error', 'Naive Prediction Error'], prop = {'size' : 17})
plt.xticks(rotation = 'vertical')
plt.xlabel('Year')
plt.ylabel('Error Value')

plt.savefig(PATH_MULTIVARIATE + 'blmbrg-naive-pred-year.png')
plt.show();

Observation 1: The most notable outliers are consistent with the previous graph. Bloomberg forecasters were overly optimistic in 2008, while overly pessimistic in 2000. In both cases, my average naive forecasts tended to be a better fit than the Bloomberg EPS forecasts, since they all tend to cluster around 0 while the Bloomberg ones veer far away from 0.

Observation 2: Overall, my naive forecasts tend to be more accurate and beter fitting than Bloomberg forecasts across all years. The only years where Bloomberg EPS forecasts are a better fit are in the years 1999, 2001, 2006, and 2007, where my naive forecasts veer farther away from 0.

In [ ]:
plt.figure(figsize = [20, 15])
ax = sb.pointplot(data = naive_blmbrg, x = naive_blmbrg.term.dt.quarter, y = 'err_value', hue = 'err_type', errwidth = 0.5, scale = .65)

plt.suptitle('Average Raw Prediction Error Types by Quarter', size = 20, y = .93)
plt.title('across All Firms', size = 17)
plt.legend(title = 'Error Type')
leg_handles = ax.get_legend_handles_labels()[0]
ax.legend(leg_handles, ['Bloomberg Prediction Error', 'Naive Prediction Error'], prop = {'size' : 17})
plt.xlabel('Quarter', size = 17)
plt.xticks(fontsize = 15)
plt.ylabel('Error Value', size = 17)

plt.savefig(PATH_MULTIVARIATE + 'blmbrg-naive-pred-quarter.png')
plt.show();

Observation 1: Across all quarters, my naive forecasts have always tended to be a better fit than the Bloomberg EPS forecasts. No matter what quarter, my naive forecast is always more reliable. This is another key takeaway.

Observation 2: Bloomberg EPS forecasts tend to be more optimistic in Q2 of any year, while my naive forecasts tend to be more pessimistic in Q2 of any year.

Observation 3: In Q4 across all years, both forecast types tended to be more optimistic. This makes intuitive sense, as I would assume that as familiarity with the trends of the current year builds up, the more confident Bloomberg forecasters would be in their EPS predictions.

Observation 4: These results are consistent with the previous visual, where Bloomberg forecasts have been overly optimistic during 2000Q4 and 2008Q4: both outliers occurring during the same quarter. How would these average prediction errors look like on a Time Series plot, after removing outlying prediction errors during those 2 terms? How would the quarterly data time series be affected? (future work)

More Raw Prediction Errors

In [ ]:
random_ids = naive_blmbrg.sample(n = 5, random_state = 4).firm_id.values
In [ ]:
naive_blmbrg_lim = naive_blmbrg[naive_blmbrg.firm_id.isin(random_ids)]
In [ ]:
naive_blmbrg_lim['firm_ticker'] = convert_ids_to_tickers(naive_blmbrg_lim.firm_id)
In [ ]:
naive_blmbrg_lim_pred = naive_blmbrg_lim[naive_blmbrg_lim['err_type'] == 'raw_pred_err']
In [ ]:
plt.figure(figsize = [20, 15])
sb.pointplot(data = naive_blmbrg_lim_pred, x = 'term', y = 'err_value', hue = 'firm_ticker', scale = .5, errwidth = 0.5)
plt.xticks(rotation = '90')

plt.suptitle('Average Raw Bloomberg EPS Prediction Errors by Term', size = 20, y = .93)
plt.title('for 5 random firms', size = 17)
plt.legend(title = 'Firm Ticker', fontsize = 'large', markerscale = 2)
leg_handles = ax.get_legend_handles_labels()[0]
plt.xlabel('Term', size = 17)
plt.ylabel('Average Bloomberg EPS Prediction Error', size = 17)

plt.savefig(PATH_MULTIVARIATE + 'blmbrg-pred-error-term.png')
plt.show();

Observation 1: To summarize with three words: for any 5 random firms, average Bloomberg EPS prediction error trends are scattered, inconsistent, and erratic.

Observation 2: Here, no outliers appear during 2000Q4 or 2008Q4. This is a key takeaway, since a few outlying prediction errors DO affect the overall prediction error trends at these 2 terms.

Observation 3: One notable outlier occurs at 2017Q4, where 3 random firms consistently show an average overly optimistic EPS forecast from Bloomberg.: GM, NWS, and TWTR.

Observation 4: Another notable outlier occurs at 2016Q1, where BAX is an outlying firm showcasing overly pessimistic EPS forecasts from Bloomberg.

In [ ]:
naive_blmbrg_lim_naive = naive_blmbrg_lim[naive_blmbrg_lim['err_type'] == 'naive_pred_err']
In [ ]:
plt.figure(figsize = [20, 15])
sb.pointplot(data = naive_blmbrg_lim_naive, x = 'term', y = 'err_value', hue = 'firm_ticker', scale = .5, errwidth = 0.5)
plt.xticks(rotation = '90')

plt.suptitle('Average Raw Naive EPS Prediction Errors by Term', size = 20, y = .93)
plt.title('for 5 random firms', size = 17)
plt.legend(title = 'Firm Ticker', fontsize = 'large', markerscale = 2)
plt.xlabel('Term', size = 17)
plt.ylabel('Average Naive EPS Prediction Error', size = 17)

plt.savefig(PATH_MULTIVARIATE + 'naive-pred-error-term.png')
plt.show();

Observation 1: Unlike the previous graph, my naive forecasts depicting the same 5 random firms is much more consistent and better fitting since all average prediction errors tend to cluster around 0.

Observation 2: There are only 2 notable outliers among these naive forecasts: at 2009Q4 and 2009Q2. My naive forecasters were overly optimistic for the former, and overly pessimistic for the latter. These 2 outliers represent 1 firm: GM.

Observation 3: For Bloomberg forecasts, GM is actually a better fit than my naive forecasts.

Observation 4: This graph depicts a much larger scale; the Bloomberg one ranges from -6 to 6 on the y-axis, while my naive one ranges from -150 to 170 on the y-axis. This means that for the 5 random firms selected, my naive forecasts tended to be a better fit. But whenever an outlying firm is depicted, those outlying values stray further from 0, which means my naive forecasts were a worse fit. To summarize: my naive forecasts are usually more reliable than the Bloomberg forecasts, ONLY when outliers are not present. (mention zeroing in on outliers)

Conclusions

Research Questions

Question 1: Does average EPS prediction error depict any differences in trends whether by a yearly, quarterly, or full-term basis?

  1. According to figure (features-act-fc-diffs-term), forecasters were most optimistic in 2008, Quarter 4, and most pessimistic in 2000, Quarter 4.

  2. According to figure (features-act-fc-diffs-year), forecasters were most pessimistic in the year 2000, and most optimistic in the year 2008.

  3. The year 2000 coincidentally shows one of the widest variances in average prediction errors (ranging from 0 to 0.5).

  4. The year 2008 also shows one of the widest variances in average prediction errors ranging from -0.25 until -1.5.

  • The two years where forecasters made overly pessimistic/optimistic predictions also coincidentally happened to contain the highest variance, and therefore, the forecasts are very spread out from the mean.
  1. According to figure (features-act-fc-diffs-quarter.png), the later the quarter, the more optimistic forecasters become in their predictions. This intuitively makes sense; as the familiarity of the year increases and patterns become established, the more I conjecture that forecasters would become confident in their EPS predictions.

  2. There is much more variance among quarterly EPS prediction errors than by a yearly or termly basis

  3. The overall trend between the first two figures depicting average EPS prediction error by quarter and year is consistent. When ignoring the pessimistic and optimistic outliers, there is no slope; all average EPS forecasts gather around 0 by the year.

  4. It is only by a quarterly basis that we see a predictable pattern emerge, where outliers (usually optimistic) happen exclusively during Q4.

  5. When comparing the quarterly trends to the full-term trends, indeed, the 2 outliers occur solely in Quarter 4 of 2000 and 2008. Therefore, our yearly and quarterly figures display congruent trends & patterns with the figure depicting full-terms.

Question 2: I generate naive EPS forecasts by calculating the rolling mean of the 2 actual EPS values from the past 2 quarters. How do my EPS forecasts compare to Bloomberg's EPS forecasts?

  1. My naive forecasts tended to be overly optimistic in 2001Q1 and 2000Q4 (the latter being consistent with the Bloomberg forecasts). Meanwhile, my naive forecasts tended to be overly pessimistic in 2000Q4 and 2001Q1.

  2. In the more recent years since 2012, Bloomberg EPS forecasts were less accurate than my own naive forecasts.

  3. My naive forecasts spiked in pessimism during 2009Q1: only right before the Great Recession began to recover in June of 2009. Meanwhile, Bloomberg forecasters tended to be overly optimistic in their EPS forecasts during the "harder" years of the bear market.

  4. Overall, my naive forecasts tend to be more accurate and better-fitting than Bloomberg forecasts across all years.

  5. Across all quarters, my naive forecasts have always tended to be a better fit than the Bloomberg EPS forecasts.

Question 3: What differences and similarities emerge when analyzing the prediction error and percentage error of EPS forecasts?

  1. For prediction errors, forecasters, on average, are likely to be more inaccurate in their predictions in Q4 of any given year. This is depicted in figure (features-perror-quarter-firm), where the top 5 most inaccurate firms by prediction error display relatively low average prediction errors until Q4, where both LRCX and AIG "branch off" into opposite directions on the x-axis.

  2. For percentage errors, EPS forecasts have become more inaccurate in the more recent terms, starting from 2014Q3. Outliers only started to appear in the trend depicted through (pct-tirm-term.png) starting in the year 2014. This conclusion is consistent with that observations made in the quarterly and yearly stripplots: (pct-firm-quarter) and (pct-firm-quarter-top-strip) for quarterly data, and (pct-firm-year) and (pct-firm-year-top-strip) for yearly data.

  3. The top 5 most inaccurate firms in terms of absolute prediction error are AGN, AIG, CHTR, LRCX, and VRSN. Respectively, these firm tickers stand for Allergan plc, American International Group Inc, Charter Communications Inc, Lam Research Corporation, and Verisign.

  4. The top 5 most inaccurate firms in terms of absolute percentage error are IBM, IRM, MCK, PXD, and QRVO. Respectively, these firm tickers stand for IBM Common Stock, Iron Mountain Inc, McKesson Corporation, Pioneer Natural Resources, and Qorvo Inc.

  5. For prediction error, the most notable outlier is AIG. This firm is, on average, the only outlier in Q4 of any given year.

  6. For percentage error, the most notable outlier is IRM, with a percentage error of over -1000 for EPS in the term 2014Q4. Outliers only started appearing starting in the year 2014

Question 4: How do my naive RMSE values compare to Bloomberg RMSE when accounting for EPS forecasts? For what percentage of firms does my naive RMSE beat Bloomberg’s RMSE?

  1. There are, in fact, more Bloomberg EPS RMSE values that are greater than my Naive EPS RMSE values per firm. This means that for the bulk majority of the data, my naive forecasts were much more reliable than the Bloomberg EPS forecasts.

  2. Of all the RMSE values, 86.95% of my Naive EPS RMSE values are less than the Bloomberg EPS RMSE values.

  3. After sorting each firm's Bloomberg and naive RMSE in decreasing order, the top 2 most inaccurately-predicted firm tickers are AIG and LRCX.

Question 5: How do EOD Prices trend across all firms from 1999 - 2019?

  1. EOD prices show a general positive trend from 2009, as can be seen with figure (features-eod-term).

  2. EOD prices see a "trough" from 2007 - 2009: the exact years of the Great Recession.

  3. All patterns shown in the Actual vs. Forecasted pointplots are consistent with the stripplot (features-feature-value) at the beginning of the bivariate stage. Actual EPS has the most outliers, and there a few "spikes" in the pointplots where actual EPS significantly veers away from the forecasted EPS trend line.

FUTURE WORK

  • Properly implement regression tests
  • Acknowledge multicollinearity
  • Make more use of eps_fc_terms
  • Look at other forecasts made from different periods, not just 3 months before. This would give me a greater insight into forecasting patterns made at various points in time before the current fiscal period.
  • Look into how average prediction errors would look like, after removing outlier values at 2000Q4 and 2008Q4.
In [ ]:
#convert notebook to HTML
from subprocess import call
call(['python', '-m', 'nbconvert', 'data_exploratory.ipynb'])